RE: Example Data Modelling

2015-07-07 Thread Peer, Oded
The data model suggested isn’t optimal for the “end of month” query you want to 
run since you are not querying by partition key.
The query would look like “select EmpID, FN, LN, basic from salaries where 
month = 1” which requires filtering and has unpredictable performance.

For this type of query to be fast you can use the “month” column as the 
partition key and the “EmpID” and the clustering column.
This approach also has drawbacks:
1. This data model creates a wide row. Depending on the number of employees 
this partition might be very large. You should limit partition sizes to 25MB
2. Distributing data according to month means that only a small number of nodes 
will hold all of the salary data for a specific month which might cause 
hotspots on those nodes.

Choose the approach that works best for you.


From: Carlos Alonso [mailto:i...@mrcalonso.com]
Sent: Monday, July 06, 2015 7:04 PM
To: user@cassandra.apache.org
Subject: Re: Example Data Modelling

Hi Srinivasa,

I think you're right, In Cassandra you should favor denormalisation when in 
RDBMS you find a relationship like this.

I'd suggest a cf like this
CREATE TABLE salaries (
  EmpID varchar,
  FN varchar,
  LN varchar,
  Phone varchar,
  Address varchar,
  month integer,
  basic integer,
  flexible_allowance float,
  PRIMARY KEY(EmpID, month)
)

That way the salaries will be partitioned by EmpID and clustered by month, 
which I guess is the natural sorting you want.

Hope it helps,
Cheers!

Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso

On 6 July 2015 at 13:01, Srinivasa T N 
seen...@gmail.commailto:seen...@gmail.com wrote:
Hi,
   I have basic doubt: I have an RDBMS with the following two tables:

   Emp - EmpID, FN, LN, Phone, Address
   Sal - Month, Empid, Basic, Flexible Allowance

   My use case is to print the Salary slip at the end of each month and the 
slip contains emp name and his other details.

   Now, if I want to have the same in cassandra, I will have a single cf with 
emp personal details and his salary details.  Is this the right approach?  
Should we have the employee personal details duplicated each month?

Regards,
Seenu.



Re: Experiencing Timeouts on one node

2015-07-07 Thread Jason Wee
3. How do we rebuild System keyspace?

wipe this node and start it all over.

hth

jason

On Tue, Jul 7, 2015 at 12:16 AM, Shashi Yachavaram shashi...@gmail.com
wrote:

 When we reboot the problematic node, we see the following errors in
 system.log.

 1. Does this mean hints column family is corrupted?
 2. Can we scrub system column family on problematic node and its
 replication partners?
 3. How do we rebuild System keyspace?

 ==
 ERROR [CompactionExecutor:950] 2015-06-27 20:11:44,595
 CassandraDaemon.java (line 191) Exception in thread
 Thread[CompactionExecutor:950,1,main]
 java.lang.AssertionError: originally calculated column size of 8684 but
 now it is 15725
 at
 org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:135)
 at
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:160)
 at
 org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:162)
 at
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at
 org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)
 at
 org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)
 at
 org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow(CompactionManager.java:442)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 ERROR [HintedHandoff:552] 2015-06-27 20:11:44,595 CassandraDaemon.java
 (line 191) Exception in thread Thread[HintedHandoff:552,1,main]
 java.lang.RuntimeException: java.util.concurrent.ExecutionException:
 java.lang.AssertionError: originally calculated column size of 8684 but now
 it is 15725
 at
 org.apache.cassandra.db.HintedHandOffManager.doDeliverHintsToEndpoint(HintedHandOffManager.java:436)
 at
 org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:282)
 at
 org.apache.cassandra.db.HintedHandOffManager.access$300(HintedHandOffManager.java:90)
 at
 org.apache.cassandra.db.HintedHandOffManager$4.run(HintedHandOffManager.java:502)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 Caused by: java.util.concurrent.ExecutionException:
 java.lang.AssertionError: originally calculated column size of 8684 but now
 it is 15725
 at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source)
 at java.util.concurrent.FutureTask.get(Unknown Source)
 at
 org.apache.cassandra.db.HintedHandOffManager.doDeliverHintsToEndpoint(HintedHandOffManager.java:432)
 ... 6 more
 Caused by: java.lang.AssertionError: originally calculated column size of
 8684 but now it is 15725
 at
 org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:135)
 at
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:160)
 at
 org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:162)
 at
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at
 org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)
 at
 org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)
 at
 org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow(CompactionManager.java:442)
 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 ==


 On Wed, Jul 1, 2015 at 11:59 AM, Shashi Yachavaram shashi...@gmail.com
 wrote:

 We have a 28 node cluster, out of which only one node is experiencing
 timeouts.
 We thought it was the raid, but there are two other nodes on the same
 raid without
 any problem. Also The problem goes away if we reboot the node, and then
 reappears
 after seven  days. The following hinted hand-off timeouts are seen on the
 node
 experiencing the timeouts. Also we did not notice any gossip errors.

 I was wondering if anyone has seen this issue and how they resolved it.

 Cassandra Version: 1.2.15.1
 OS: Linux cm 2.6.32-504.8.1.el6.x86_64 #1 SMP Fri Dec 19 

Re: Example Data Modelling

2015-07-07 Thread Srinivasa T N
Thanks for the inputs.

Now my question is how should the app populate the duplicate data, i.e., if
I have an employee record (along with his FN, LN,..) for the month of Apr
and later I am populating the same record for the month of may (with salary
changed), should my application first read/fetch the corresponding data for
apr and re-insert with modification for month of may?

Regards,
Seenu.

On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote:

  The data model suggested isn’t optimal for the “end of month” query you
 want to run since you are not querying by partition key.

 The query would look like “select EmpID, FN, LN, basic from salaries where
 month = 1” which requires filtering and has unpredictable performance.



 For this type of query to be fast you can use the “month” column as the
 partition key and the “EmpID” and the clustering column.

 This approach also has drawbacks:

 1. This data model creates a wide row. Depending on the number of
 employees this partition might be very large. You should limit partition
 sizes to 25MB

 2. Distributing data according to month means that only a small number of
 nodes will hold all of the salary data for a specific month which might
 cause hotspots on those nodes.



 Choose the approach that works best for you.





 *From:* Carlos Alonso [mailto:i...@mrcalonso.com]
 *Sent:* Monday, July 06, 2015 7:04 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Example Data Modelling



 Hi Srinivasa,



 I think you're right, In Cassandra you should favor denormalisation when
 in RDBMS you find a relationship like this.



 I'd suggest a cf like this

 CREATE TABLE salaries (

   EmpID varchar,

   FN varchar,

   LN varchar,

   Phone varchar,

   Address varchar,

   month integer,

   basic integer,

   flexible_allowance float,

   PRIMARY KEY(EmpID, month)

 )



 That way the salaries will be partitioned by EmpID and clustered by month,
 which I guess is the natural sorting you want.



 Hope it helps,

 Cheers!


   Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso



 On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote:

 Hi,

I have basic doubt: I have an RDBMS with the following two tables:

Emp - EmpID, FN, LN, Phone, Address
Sal - Month, Empid, Basic, Flexible Allowance

My use case is to print the Salary slip at the end of each month and
 the slip contains emp name and his other details.

Now, if I want to have the same in cassandra, I will have a single cf
 with emp personal details and his salary details.  Is this the right
 approach?  Should we have the employee personal details duplicated each
 month?

 Regards,
 Seenu.





Re: Example Data Modelling

2015-07-07 Thread Carlos Alonso
I guess you're right, using my proposal, getting last employee's record is
straightforward and quick, but also, as Peter pointed, getting all slips
for a particular month requires you to know all the employee IDs and,
ideally, run a query for each employee. This would work depending on how
many employees you're managing.

At this moment I'm beginning to feel that maybe using both approaches is
the best way to go. And I think this is one of Cassandra's recommendations:
Write your data in several formats if required to fit your reads. Therefore
I'd use my suggestion for getting a salary by employee ID and I'd also have
Peter's one to run the end of the month query.
Does it make sense?

Cheers!

Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso

On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote:

 Thanks for the inputs.

 Now my question is how should the app populate the duplicate data, i.e.,
 if I have an employee record (along with his FN, LN,..) for the month of
 Apr and later I am populating the same record for the month of may (with
 salary changed), should my application first read/fetch the corresponding
 data for apr and re-insert with modification for month of may?

 Regards,
 Seenu.

 On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote:

  The data model suggested isn’t optimal for the “end of month” query you
 want to run since you are not querying by partition key.

 The query would look like “select EmpID, FN, LN, basic from salaries
 where month = 1” which requires filtering and has unpredictable performance.



 For this type of query to be fast you can use the “month” column as the
 partition key and the “EmpID” and the clustering column.

 This approach also has drawbacks:

 1. This data model creates a wide row. Depending on the number of
 employees this partition might be very large. You should limit partition
 sizes to 25MB

 2. Distributing data according to month means that only a small number of
 nodes will hold all of the salary data for a specific month which might
 cause hotspots on those nodes.



 Choose the approach that works best for you.





 *From:* Carlos Alonso [mailto:i...@mrcalonso.com]
 *Sent:* Monday, July 06, 2015 7:04 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Example Data Modelling



 Hi Srinivasa,



 I think you're right, In Cassandra you should favor denormalisation when
 in RDBMS you find a relationship like this.



 I'd suggest a cf like this

 CREATE TABLE salaries (

   EmpID varchar,

   FN varchar,

   LN varchar,

   Phone varchar,

   Address varchar,

   month integer,

   basic integer,

   flexible_allowance float,

   PRIMARY KEY(EmpID, month)

 )



 That way the salaries will be partitioned by EmpID and clustered by
 month, which I guess is the natural sorting you want.



 Hope it helps,

 Cheers!


   Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso



 On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote:

 Hi,

I have basic doubt: I have an RDBMS with the following two tables:

Emp - EmpID, FN, LN, Phone, Address
Sal - Month, Empid, Basic, Flexible Allowance

My use case is to print the Salary slip at the end of each month and
 the slip contains emp name and his other details.

Now, if I want to have the same in cassandra, I will have a single cf
 with emp personal details and his salary details.  Is this the right
 approach?  Should we have the employee personal details duplicated each
 month?

 Regards,
 Seenu.







Re: Is there a way to remove a node with Opscenter?

2015-07-07 Thread Sid Tantia
Thanks for the response. I’m trying to remove a node that’s already down for 
some reason so its not allowing me to decommission it, is there some other way 
to do this?

On Tue, Jul 7, 2015 at 12:45 PM, Kiran mk coolkiran2...@gmail.com wrote:

 Yes, if your intension is to decommission a node.  You can do that by
 clicking on the node and decommission.
 Best Regards,
 Kiran.M.K.
 On Jul 8, 2015 1:00 AM, Sid Tantia sid.tan...@baseboxsoftware.com wrote:
  I know you can use `nodetool removenode` from the command line but is
 there a way to remove a node from a cluster using OpsCenter?



Re: Question on 'Average tombstones per slice' when running cfstats

2015-07-07 Thread Robert Coli
On Mon, Jul 6, 2015 at 5:38 PM, Jonathan Haddad j...@jonhaddad.com wrote:

 Wouldn't it suggest a delete heavy workload, rather than update?


I consider DELETE a case of UPDATE, but sure, you are correct. :D

=Rob


Is there a way to remove a node with Opscenter?

2015-07-07 Thread Sid Tantia
I know you can use `nodetool removenode` from the command line but is there a 
way to remove a node from a cluster using OpsCenter?

Re: Is there a way to remove a node with Opscenter?

2015-07-07 Thread Kiran mk
Yes, if your intension is to decommission a node.  You can do that by
clicking on the node and decommission.

Best Regards,
Kiran.M.K.
On Jul 8, 2015 1:00 AM, Sid Tantia sid.tan...@baseboxsoftware.com wrote:

  I know you can use `nodetool removenode` from the command line but is
 there a way to remove a node from a cluster using OpsCenter?




Re: Is there a way to remove a node with Opscenter?

2015-07-07 Thread Surbhi Gupta
If node is down use :

nodetool removenode Host ID

We have to run the below command when the node is down  if the cluster
does not use vnodes, before running the nodetool removenode command, adjust
the tokens.

If the node is up, then the command would be “nodetool decommission” to
remove the node.

Remove the node from the “seed list” within the configuration
cassandra.yaml.

On 7 July 2015 at 12:56, Sid Tantia sid.tan...@baseboxsoftware.com wrote:

 Thanks for the response. I’m trying to remove a node that’s already down
 for some reason so its not allowing me to decommission it, is there some
 other way to do this?



 On Tue, Jul 7, 2015 at 12:45 PM, Kiran mk coolkiran2...@gmail.com wrote:

 Yes, if your intension is to decommission a node.  You can do that by
 clicking on the node and decommission.

 Best Regards,
 Kiran.M.K.
 On Jul 8, 2015 1:00 AM, Sid Tantia sid.tan...@baseboxsoftware.com
 wrote:

  I know you can use `nodetool removenode` from the command line but is
 there a way to remove a node from a cluster using OpsCenter?





Re: Is there a way to remove a node with Opscenter?

2015-07-07 Thread Sid Tantia
I tried both `nodetool remove node Host ID` and `nodetool decommission` and 
they both give the error:




nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection 
refused’.




Here is what I have tried to fix this:




1) Uncommented JVM_OPTS=”$JVM_OPTS -Djava.rmi.server.hostname=public name”

2) Changed rpc_address to 0.0.0.0

3) Restarted cassandra

4) Restarted datastax-agent




(Note that I installed my cluster using opscenter so that may have something to 
do with it? )

On Tue, Jul 7, 2015 at 2:08 PM, Surbhi Gupta surbhi.gupt...@gmail.com
wrote:

 If node is down use :
 nodetool removenode Host ID
 We have to run the below command when the node is down  if the cluster
 does not use vnodes, before running the nodetool removenode command, adjust
 the tokens.
 If the node is up, then the command would be “nodetool decommission” to
 remove the node.
 Remove the node from the “seed list” within the configuration
 cassandra.yaml.
 On 7 July 2015 at 12:56, Sid Tantia sid.tan...@baseboxsoftware.com wrote:
 Thanks for the response. I’m trying to remove a node that’s already down
 for some reason so its not allowing me to decommission it, is there some
 other way to do this?



 On Tue, Jul 7, 2015 at 12:45 PM, Kiran mk coolkiran2...@gmail.com wrote:

 Yes, if your intension is to decommission a node.  You can do that by
 clicking on the node and decommission.

 Best Regards,
 Kiran.M.K.
 On Jul 8, 2015 1:00 AM, Sid Tantia sid.tan...@baseboxsoftware.com
 wrote:

  I know you can use `nodetool removenode` from the command line but is
 there a way to remove a node from a cluster using OpsCenter?




Re: Is there a way to remove a node with Opscenter?

2015-07-07 Thread Robert Coli
On Tue, Jul 7, 2015 at 4:39 PM, Sid Tantia sid.tan...@baseboxsoftware.com
wrote:

 I tried both `nodetool remove node Host ID` and `nodetool decommission`
 and they both give the error:

  nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException:
 'Connection refused’.

 Here is what I have tried to fix this:


Instead of that stuff, why not :

1) use lsof to determine what IP is being listened to on 7199 by the
running process?
2) connect to that IP

?

=Rob


DTCS dropping of SST Tables

2015-07-07 Thread Anishek Agarwal
Hey all,

We are using DTCS and we have a ttl of 30 days for all inserts, there are
no deletes/updates we do.
When the SST tables is dropped by DTCS what kind of logging do we see in C*
logs.

any help would be useful. The reason is my db size is not hovering around a
size it is increasing, there has been no significant change in traffic that
creates data in C*.

thanks
anishek


Re: Example Data Modelling

2015-07-07 Thread Rory Bramwell, DevOp Services
Hi,

I've been following this thread and my thoughts are inline with Carlos'
latest response... Model your data to suite your queries. That is one of
the data model / design considerations in Cassandra that differs from the
RDBMS world. Embrace demoralization and data duplication. Disk space is
cheapest, so exploit how your data is laid out in order to optimize for
faster reads (which are more costly than writes).

Regards,

Rory Bramwell
Founder and CEO
DevOp Services

Skype: devopservices
Email: rory.bramw...@devopservices.com
Web: www.devopservices.com
On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote:

 I guess you're right, using my proposal, getting last employee's record is
 straightforward and quick, but also, as Peter pointed, getting all slips
 for a particular month requires you to know all the employee IDs and,
 ideally, run a query for each employee. This would work depending on how
 many employees you're managing.

 At this moment I'm beginning to feel that maybe using both approaches is
 the best way to go. And I think this is one of Cassandra's recommendations:
 Write your data in several formats if required to fit your reads. Therefore
 I'd use my suggestion for getting a salary by employee ID and I'd also have
 Peter's one to run the end of the month query.
 Does it make sense?

 Cheers!

 Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso

 On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote:

 Thanks for the inputs.

 Now my question is how should the app populate the duplicate data, i.e.,
 if I have an employee record (along with his FN, LN,..) for the month of
 Apr and later I am populating the same record for the month of may (with
 salary changed), should my application first read/fetch the corresponding
 data for apr and re-insert with modification for month of may?

 Regards,
 Seenu.

 On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote:

  The data model suggested isn’t optimal for the “end of month” query
 you want to run since you are not querying by partition key.

 The query would look like “select EmpID, FN, LN, basic from salaries
 where month = 1” which requires filtering and has unpredictable performance.



 For this type of query to be fast you can use the “month” column as the
 partition key and the “EmpID” and the clustering column.

 This approach also has drawbacks:

 1. This data model creates a wide row. Depending on the number of
 employees this partition might be very large. You should limit partition
 sizes to 25MB

 2. Distributing data according to month means that only a small number
 of nodes will hold all of the salary data for a specific month which might
 cause hotspots on those nodes.



 Choose the approach that works best for you.





 *From:* Carlos Alonso [mailto:i...@mrcalonso.com]
 *Sent:* Monday, July 06, 2015 7:04 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Example Data Modelling



 Hi Srinivasa,



 I think you're right, In Cassandra you should favor denormalisation when
 in RDBMS you find a relationship like this.



 I'd suggest a cf like this

 CREATE TABLE salaries (

   EmpID varchar,

   FN varchar,

   LN varchar,

   Phone varchar,

   Address varchar,

   month integer,

   basic integer,

   flexible_allowance float,

   PRIMARY KEY(EmpID, month)

 )



 That way the salaries will be partitioned by EmpID and clustered by
 month, which I guess is the natural sorting you want.



 Hope it helps,

 Cheers!


   Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso



 On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote:

 Hi,

I have basic doubt: I have an RDBMS with the following two tables:

Emp - EmpID, FN, LN, Phone, Address
Sal - Month, Empid, Basic, Flexible Allowance

My use case is to print the Salary slip at the end of each month and
 the slip contains emp name and his other details.

Now, if I want to have the same in cassandra, I will have a single cf
 with emp personal details and his salary details.  Is this the right
 approach?  Should we have the employee personal details duplicated each
 month?

 Regards,
 Seenu.








Re: Compaction issues, 2.0.12

2015-07-07 Thread Jeff Williams
Thanks Rob, Jeff. I have updated the Jira issue with my information.

On 6 July 2015 at 23:46, Jeff Ferland j...@tubularlabs.com wrote:

 I’ve seen the same thing:
 https://issues.apache.org/jira/browse/CASSANDRA-9577

 I’ve had cases where a restart clears the old tables, and I’ve had cases
 where a restart considers the old tables to be live.

 On Jul 6, 2015, at 1:51 PM, Robert Coli rc...@eventbrite.com wrote:

 On Mon, Jul 6, 2015 at 1:46 PM, Jeff Williams je...@wherethebitsroam.com
 wrote:

 1) Cassandra version is 2.0.12.



 2) Interesting. Looking at JMX org.apache.cassandra.db - ColumnFamilies
 - trackcontent - track_content - Attributes, I get:

 LiveDiskSpaceUsed: 17788740448, i.e. ~17GB
 LiveSSTableCount: 3
 TotalDiskSpaceUsed: 55714084629, i.e. ~55GB

 So it obviously knows about the extra disk space even though the live
 space looks correct. I couldn't find anything to identify the actual files
 though.


 That's what I would expect.


 3) So that was even more interesting. After restarting the cassandra
 daemon, the sstables were not deleted and now the same JMX attributes are:

 LiveDiskSpaceUsed: 55681040579, i.e. ~55GB
 LiveSSTableCount: 8
 TotalDiskSpaceUsed: 55681040579, i.e. ~55GB

 So some of my non-live tables are back live again, and obviously some of
 the big ones!!


 This is permanently fatal to consistency; sorry if I was not clear enough
 that if they were not live, there was some risk of Cassandra considering
 them live again upon restart.

 If I were you, I would either stop the node and remove the files you know
 shouldn't be live or do a major compaction ASAP.

 The behavior you just encountered sounds like a bug, and it is a rather
 serious one. SSTables which should be dead being marked live is very bad
 for consistency.

 Do you see any exceptions in your logs or anything? If you can repro, you
 should file a JIRA ticket with the apache project...

 =Rob





Re: Is there a way to remove a node with Opscenter?

2015-07-07 Thread Michael Shuler

On 07/07/2015 07:27 PM, Robert Coli wrote:

On Tue, Jul 7, 2015 at 4:39 PM, Sid Tantia
sid.tan...@baseboxsoftware.com mailto:sid.tan...@baseboxsoftware.com
wrote:

I tried both `nodetool remove node Host ID` and `nodetool
decommission` and they both give the error:

nodetool: Failed to connect to '127.0.0.1:7199
http://127.0.0.1:7199' - ConnectException: 'Connection refused’.

Here is what I have tried to fix this:


Instead of that stuff, why not :

1) use lsof to determine what IP is being listened to on 7199 by the
running process?
2) connect to that IP


OP said node was already down/dead:

Don't forget the hierarchy of node removal in #cassandra: decommission, 
removenode, removenode force, assassinate.  Escalate in that order.


https://twitter.com/faltering/status/559845791741657088

:)
Michael



Consistent reads and first write wins

2015-07-07 Thread John Sanda
Suppose I have the following schema,

CREATE TABLE foo (
id text,
time timeuuid,
prop1 text,
PRIMARY KEY (id, time)
)
WITHCLUSTERING ORDER BY (time ASC);

And I have two clients who execute quorum writes, e.g.,

// client 1
INSERT INTO FOO (id, time, prop1) VALUES ('test', time_uuid_1, 'bar');

// client 2
INSERT INTO FOO (id, time, prop1) VALUES ('test', time_uuid_2, 'bam');

If time_uuid_1 comes before time_uuid_2 and if both clients follow up the
writes with quorum reads, then will both clients see the value 'bar' for
prop1? Are there situations in which clients might see different values?


-- 

- John


How to export query results (milions rows) as CSV fomat?

2015-07-07 Thread shahab
Hi,

Is there any way to export the results of a query (e.g. select * from tbl1
where id =aa and loc =bb) into a file as CSV format?

I tried to use COPY command with cqlsh, but the command does not work
when you have where  condition ?!!!

does any have any idea how to do this?

best,
/Shahab


Re: Example Data Modelling

2015-07-07 Thread Carlos Alonso
Hi Jerome,

Good point!! Really a nice usage of static columns! BTW, wouldn't the EmpID
be static as well?

Cheers

Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso

On 7 July 2015 at 14:42, Jérôme Mainaud jer...@mainaud.com wrote:

 Hello,

 You can slightly adapt Carlos answer to reduce repliation of data that
 don't change for month to month.
 Static columns are great for this.

 The table become:

 CREATE TABLE salaries (
   EmpID varchar,
   FN varchar *static*,
   LN varchar *static*,
   Phone varchar *static*,
   Address varchar *static*,
   month integer,
   basic integer,
   flexible_allowance float,
   PRIMARY KEY(EmpID, month)
 )

 There is only one copy of static column per partition the value is shared
 between all rows of the partition.
 When Employee data change you can update it with the partition key in the
 where clause.
 When you insert a new month entry you just fill non static columns.
 The table can be queried the same way as the original one.

 Cheers



 --
 Jérôme Mainaud
 jer...@mainaud.com

 2015-07-07 11:51 GMT+02:00 Rory Bramwell, DevOp Services 
 rory.bramw...@devopservices.com:

 Hi,

 I've been following this thread and my thoughts are inline with Carlos'
 latest response... Model your data to suite your queries. That is one of
 the data model / design considerations in Cassandra that differs from the
 RDBMS world. Embrace demoralization and data duplication. Disk space is
 cheapest, so exploit how your data is laid out in order to optimize for
 faster reads (which are more costly than writes).

 Regards,

 Rory Bramwell
 Founder and CEO
 DevOp Services

 Skype: devopservices
 Email: rory.bramw...@devopservices.com
 Web: www.devopservices.com
 On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote:

 I guess you're right, using my proposal, getting last employee's record
 is straightforward and quick, but also, as Peter pointed, getting all slips
 for a particular month requires you to know all the employee IDs and,
 ideally, run a query for each employee. This would work depending on how
 many employees you're managing.

 At this moment I'm beginning to feel that maybe using both approaches is
 the best way to go. And I think this is one of Cassandra's recommendations:
 Write your data in several formats if required to fit your reads. Therefore
 I'd use my suggestion for getting a salary by employee ID and I'd also have
 Peter's one to run the end of the month query.
 Does it make sense?

 Cheers!

 Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso

 On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote:

 Thanks for the inputs.

 Now my question is how should the app populate the duplicate data,
 i.e., if I have an employee record (along with his FN, LN,..) for the month
 of Apr and later I am populating the same record for the month of may (with
 salary changed), should my application first read/fetch the corresponding
 data for apr and re-insert with modification for month of may?

 Regards,
 Seenu.

 On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote:

  The data model suggested isn’t optimal for the “end of month” query
 you want to run since you are not querying by partition key.

 The query would look like “select EmpID, FN, LN, basic from salaries
 where month = 1” which requires filtering and has unpredictable 
 performance.



 For this type of query to be fast you can use the “month” column as
 the partition key and the “EmpID” and the clustering column.

 This approach also has drawbacks:

 1. This data model creates a wide row. Depending on the number of
 employees this partition might be very large. You should limit partition
 sizes to 25MB

 2. Distributing data according to month means that only a small number
 of nodes will hold all of the salary data for a specific month which might
 cause hotspots on those nodes.



 Choose the approach that works best for you.





 *From:* Carlos Alonso [mailto:i...@mrcalonso.com]
 *Sent:* Monday, July 06, 2015 7:04 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Example Data Modelling



 Hi Srinivasa,



 I think you're right, In Cassandra you should favor denormalisation
 when in RDBMS you find a relationship like this.



 I'd suggest a cf like this

 CREATE TABLE salaries (

   EmpID varchar,

   FN varchar,

   LN varchar,

   Phone varchar,

   Address varchar,

   month integer,

   basic integer,

   flexible_allowance float,

   PRIMARY KEY(EmpID, month)

 )



 That way the salaries will be partitioned by EmpID and clustered by
 month, which I guess is the natural sorting you want.



 Hope it helps,

 Cheers!


   Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso



 On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote:

 Hi,

I have basic doubt: I have an RDBMS with the following two tables:

Emp - EmpID, FN, LN, Phone, Address
Sal - Month, Empid, Basic, Flexible Allowance

   

Re: Example Data Modelling

2015-07-07 Thread John Sanda
25 MB seems very specific. Is there a reason why?

On Tuesday, July 7, 2015, Peer, Oded oded.p...@rsa.com wrote:

  The data model suggested isn’t optimal for the “end of month” query you
 want to run since you are not querying by partition key.

 The query would look like “select EmpID, FN, LN, basic from salaries where
 month = 1” which requires filtering and has unpredictable performance.



 For this type of query to be fast you can use the “month” column as the
 partition key and the “EmpID” and the clustering column.

 This approach also has drawbacks:

 1. This data model creates a wide row. Depending on the number of
 employees this partition might be very large. You should limit partition
 sizes to 25MB

 2. Distributing data according to month means that only a small number of
 nodes will hold all of the salary data for a specific month which might
 cause hotspots on those nodes.



 Choose the approach that works best for you.





 *From:* Carlos Alonso [mailto:i...@mrcalonso.com
 javascript:_e(%7B%7D,'cvml','i...@mrcalonso.com');]
 *Sent:* Monday, July 06, 2015 7:04 PM
 *To:* user@cassandra.apache.org
 javascript:_e(%7B%7D,'cvml','user@cassandra.apache.org');
 *Subject:* Re: Example Data Modelling



 Hi Srinivasa,



 I think you're right, In Cassandra you should favor denormalisation when
 in RDBMS you find a relationship like this.



 I'd suggest a cf like this

 CREATE TABLE salaries (

   EmpID varchar,

   FN varchar,

   LN varchar,

   Phone varchar,

   Address varchar,

   month integer,

   basic integer,

   flexible_allowance float,

   PRIMARY KEY(EmpID, month)

 )



 That way the salaries will be partitioned by EmpID and clustered by month,
 which I guess is the natural sorting you want.



 Hope it helps,

 Cheers!


   Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso



 On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com
 javascript:_e(%7B%7D,'cvml','seen...@gmail.com'); wrote:

 Hi,

I have basic doubt: I have an RDBMS with the following two tables:

Emp - EmpID, FN, LN, Phone, Address
Sal - Month, Empid, Basic, Flexible Allowance

My use case is to print the Salary slip at the end of each month and
 the slip contains emp name and his other details.

Now, if I want to have the same in cassandra, I will have a single cf
 with emp personal details and his salary details.  Is this the right
 approach?  Should we have the employee personal details duplicated each
 month?

 Regards,
 Seenu.





-- 

- John


Re: Example Data Modelling

2015-07-07 Thread Jérôme Mainaud
Hello,

You can slightly adapt Carlos answer to reduce repliation of data that
don't change for month to month.
Static columns are great for this.

The table become:

CREATE TABLE salaries (
  EmpID varchar,
  FN varchar *static*,
  LN varchar *static*,
  Phone varchar *static*,
  Address varchar *static*,
  month integer,
  basic integer,
  flexible_allowance float,
  PRIMARY KEY(EmpID, month)
)

There is only one copy of static column per partition the value is shared
between all rows of the partition.
When Employee data change you can update it with the partition key in the
where clause.
When you insert a new month entry you just fill non static columns.
The table can be queried the same way as the original one.

Cheers



-- 
Jérôme Mainaud
jer...@mainaud.com

2015-07-07 11:51 GMT+02:00 Rory Bramwell, DevOp Services 
rory.bramw...@devopservices.com:

 Hi,

 I've been following this thread and my thoughts are inline with Carlos'
 latest response... Model your data to suite your queries. That is one of
 the data model / design considerations in Cassandra that differs from the
 RDBMS world. Embrace demoralization and data duplication. Disk space is
 cheapest, so exploit how your data is laid out in order to optimize for
 faster reads (which are more costly than writes).

 Regards,

 Rory Bramwell
 Founder and CEO
 DevOp Services

 Skype: devopservices
 Email: rory.bramw...@devopservices.com
 Web: www.devopservices.com
 On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote:

 I guess you're right, using my proposal, getting last employee's record
 is straightforward and quick, but also, as Peter pointed, getting all slips
 for a particular month requires you to know all the employee IDs and,
 ideally, run a query for each employee. This would work depending on how
 many employees you're managing.

 At this moment I'm beginning to feel that maybe using both approaches is
 the best way to go. And I think this is one of Cassandra's recommendations:
 Write your data in several formats if required to fit your reads. Therefore
 I'd use my suggestion for getting a salary by employee ID and I'd also have
 Peter's one to run the end of the month query.
 Does it make sense?

 Cheers!

 Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso

 On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote:

 Thanks for the inputs.

 Now my question is how should the app populate the duplicate data, i.e.,
 if I have an employee record (along with his FN, LN,..) for the month of
 Apr and later I am populating the same record for the month of may (with
 salary changed), should my application first read/fetch the corresponding
 data for apr and re-insert with modification for month of may?

 Regards,
 Seenu.

 On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote:

  The data model suggested isn’t optimal for the “end of month” query
 you want to run since you are not querying by partition key.

 The query would look like “select EmpID, FN, LN, basic from salaries
 where month = 1” which requires filtering and has unpredictable 
 performance.



 For this type of query to be fast you can use the “month” column as the
 partition key and the “EmpID” and the clustering column.

 This approach also has drawbacks:

 1. This data model creates a wide row. Depending on the number of
 employees this partition might be very large. You should limit partition
 sizes to 25MB

 2. Distributing data according to month means that only a small number
 of nodes will hold all of the salary data for a specific month which might
 cause hotspots on those nodes.



 Choose the approach that works best for you.





 *From:* Carlos Alonso [mailto:i...@mrcalonso.com]
 *Sent:* Monday, July 06, 2015 7:04 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Example Data Modelling



 Hi Srinivasa,



 I think you're right, In Cassandra you should favor denormalisation
 when in RDBMS you find a relationship like this.



 I'd suggest a cf like this

 CREATE TABLE salaries (

   EmpID varchar,

   FN varchar,

   LN varchar,

   Phone varchar,

   Address varchar,

   month integer,

   basic integer,

   flexible_allowance float,

   PRIMARY KEY(EmpID, month)

 )



 That way the salaries will be partitioned by EmpID and clustered by
 month, which I guess is the natural sorting you want.



 Hope it helps,

 Cheers!


   Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso



 On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote:

 Hi,

I have basic doubt: I have an RDBMS with the following two tables:

Emp - EmpID, FN, LN, Phone, Address
Sal - Month, Empid, Basic, Flexible Allowance

My use case is to print the Salary slip at the end of each month and
 the slip contains emp name and his other details.

Now, if I want to have the same in cassandra, I will have a single
 cf with emp personal details and his salary details.  Is this the right
 approach?  

[ANNOUNCE] YCSB 0.2.0 Release

2015-07-07 Thread Sean Busbey
On behalf of the development community, I am pleased to announce the
release of YCSB version 0.2.0.

Highlights:

* Apache Cassandra 2.0 CQL support
* Apache HBase 1.0 support
* Apache Accumulo 1.6 support
* MongoDB - support for all production versions released since 2011
* Tarantool 1.6 support
* ~5 additional datastore bindings in experimental status
* Optional support for latency collection via HdrHistogram
* Optional support for fixing coordinated omission

Full release notes, including links to source and convenience binaries:

https://github.com/brianfrankcooper/YCSB/releases/tag/0.2.0

This release is the first from the project in 3.5 years, so I'd recommend
reading the release notes if you're a user.

The project is moving to a monthly release cadence, so hopefully future
releases will be easier to incrementally consume.

-- 
Sean


Streaming data to Cassandra with Hadoop

2015-07-07 Thread Elad Efrat
Hello,

I'm loading data from HDFS Cassandra using Spotify's hdfs2cass. The
setup is a 4-node cluster running Cassandra 2.1.6, RF=2, STCS, raw
data size is about 1tb before loading and 3.8tb after. The process
works fine, but I do have a few questions.

1. Some Hadoop jobs fail due to streaming timeouts. That's fine,
because subsequent attempts succeed, but why do I get the timeouts in
the first place? Would this be something network-related or does
Cassandra have a limit on how much streaming it can handle?

2. The server logs show errors like the one quoted below, for
malformed input around byte N --

ERROR [STREAM-IN-/10.84.30.209] 2015-07-06 11:30:10,915
StreamSession.java:499 - [Stream
#e1e4f470-23fb-11e5-9c95-9b249a189cad] Streaming error occurred
java.io.UTFDataFormatException: malformed input around byte 10
at java.io.DataInputStream.readUTF(DataInputStream.java:656) ~[na:1.7.0_67]
at java.io.DataInputStream.readUTF(DataInputStream.java:564) ~[na:1.7.0_67]
at 
org.apache.cassandra.streaming.messages.FileMessageHeader$FileMessageHeaderSerializer.deserialize(FileMessageHeader.java:143)
~[apache-cassandra-2.1.6.jar:2.1.6]
at 
org.apache.cassandra.streaming.messages.FileMessageHeader$FileMessageHeaderSerializer.deserialize(FileMessageHeader.java:120)
~[apache-cassandra-2.1.6.jar:2.1.6]
at 
org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:42)
~[apache-cassandra-2.1.6.jar:2.1.6]
at 
org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:38)
~[apache-cassandra-2.1.6.jar:2.1.6]
at 
org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:55)
~[apache-cassandra-2.1.6.jar:2.1.6]
at 
org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:250)
~[apache-cassandra-2.1.6.jar:2.1.6]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]

Is this a familiar issue? I'd expect the data to be the same across
all streaming attempts. The timeouts I can theorize about, any
thoughts on what might be causing these though? Is it normal?

3. About compaction. There's a RESTful service in front of Cassandra
and I see the average response time is positively correlated with the
number of compactions pending (it drops as they drop). Is there a way
to stream such that the number of compactions once the streaming is
done is minimal?

4. Also about compaction: I understand that while STCS is
write-optimized and reduces the number of SSTables, LCS is
read-optimized and might increase it. The aforementioned service needs
read-only access to Cassandra. Loading with LCS resulted in an order
of magnitude more compactions and dramatically higher server load.
Given I want minimal response time ASAP, what approach should I be
taking? Right now I load with STCS, wait for compactions to finish,
and I consider a switch to LCS once it's done. Does it make sense? Any
thoughts on improving this process? (Ideally - is there anything close
to a one-shot process where compaction is barely required?)

I'll gladly provide additional information if needed. I'll also be
happy to hear about others' experience in similar scenarios.

Thanks,

Elad