RE: Example Data Modelling
The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key. The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column. This approach also has drawbacks: 1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB 2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. From: Carlos Alonso [mailto:i...@mrcalonso.com] Sent: Monday, July 06, 2015 7:04 PM To: user@cassandra.apache.org Subject: Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.commailto:seen...@gmail.com wrote: Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance My use case is to print the Salary slip at the end of each month and the slip contains emp name and his other details. Now, if I want to have the same in cassandra, I will have a single cf with emp personal details and his salary details. Is this the right approach? Should we have the employee personal details duplicated each month? Regards, Seenu.
Re: Experiencing Timeouts on one node
3. How do we rebuild System keyspace? wipe this node and start it all over. hth jason On Tue, Jul 7, 2015 at 12:16 AM, Shashi Yachavaram shashi...@gmail.com wrote: When we reboot the problematic node, we see the following errors in system.log. 1. Does this mean hints column family is corrupted? 2. Can we scrub system column family on problematic node and its replication partners? 3. How do we rebuild System keyspace? == ERROR [CompactionExecutor:950] 2015-06-27 20:11:44,595 CassandraDaemon.java (line 191) Exception in thread Thread[CompactionExecutor:950,1,main] java.lang.AssertionError: originally calculated column size of 8684 but now it is 15725 at org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:135) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:160) at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:162) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58) at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60) at org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow(CompactionManager.java:442) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ERROR [HintedHandoff:552] 2015-06-27 20:11:44,595 CassandraDaemon.java (line 191) Exception in thread Thread[HintedHandoff:552,1,main] java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.AssertionError: originally calculated column size of 8684 but now it is 15725 at org.apache.cassandra.db.HintedHandOffManager.doDeliverHintsToEndpoint(HintedHandOffManager.java:436) at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:282) at org.apache.cassandra.db.HintedHandOffManager.access$300(HintedHandOffManager.java:90) at org.apache.cassandra.db.HintedHandOffManager$4.run(HintedHandOffManager.java:502) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.util.concurrent.ExecutionException: java.lang.AssertionError: originally calculated column size of 8684 but now it is 15725 at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source) at java.util.concurrent.FutureTask.get(Unknown Source) at org.apache.cassandra.db.HintedHandOffManager.doDeliverHintsToEndpoint(HintedHandOffManager.java:432) ... 6 more Caused by: java.lang.AssertionError: originally calculated column size of 8684 but now it is 15725 at org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:135) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:160) at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:162) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58) at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60) at org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow(CompactionManager.java:442) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) == On Wed, Jul 1, 2015 at 11:59 AM, Shashi Yachavaram shashi...@gmail.com wrote: We have a 28 node cluster, out of which only one node is experiencing timeouts. We thought it was the raid, but there are two other nodes on the same raid without any problem. Also The problem goes away if we reboot the node, and then reappears after seven days. The following hinted hand-off timeouts are seen on the node experiencing the timeouts. Also we did not notice any gossip errors. I was wondering if anyone has seen this issue and how they resolved it. Cassandra Version: 1.2.15.1 OS: Linux cm 2.6.32-504.8.1.el6.x86_64 #1 SMP Fri Dec 19
Re: Example Data Modelling
Thanks for the inputs. Now my question is how should the app populate the duplicate data, i.e., if I have an employee record (along with his FN, LN,..) for the month of Apr and later I am populating the same record for the month of may (with salary changed), should my application first read/fetch the corresponding data for apr and re-insert with modification for month of may? Regards, Seenu. On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote: The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key. The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column. This approach also has drawbacks: 1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB 2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. *From:* Carlos Alonso [mailto:i...@mrcalonso.com] *Sent:* Monday, July 06, 2015 7:04 PM *To:* user@cassandra.apache.org *Subject:* Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote: Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance My use case is to print the Salary slip at the end of each month and the slip contains emp name and his other details. Now, if I want to have the same in cassandra, I will have a single cf with emp personal details and his salary details. Is this the right approach? Should we have the employee personal details duplicated each month? Regards, Seenu.
Re: Example Data Modelling
I guess you're right, using my proposal, getting last employee's record is straightforward and quick, but also, as Peter pointed, getting all slips for a particular month requires you to know all the employee IDs and, ideally, run a query for each employee. This would work depending on how many employees you're managing. At this moment I'm beginning to feel that maybe using both approaches is the best way to go. And I think this is one of Cassandra's recommendations: Write your data in several formats if required to fit your reads. Therefore I'd use my suggestion for getting a salary by employee ID and I'd also have Peter's one to run the end of the month query. Does it make sense? Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote: Thanks for the inputs. Now my question is how should the app populate the duplicate data, i.e., if I have an employee record (along with his FN, LN,..) for the month of Apr and later I am populating the same record for the month of may (with salary changed), should my application first read/fetch the corresponding data for apr and re-insert with modification for month of may? Regards, Seenu. On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote: The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key. The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column. This approach also has drawbacks: 1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB 2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. *From:* Carlos Alonso [mailto:i...@mrcalonso.com] *Sent:* Monday, July 06, 2015 7:04 PM *To:* user@cassandra.apache.org *Subject:* Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote: Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance My use case is to print the Salary slip at the end of each month and the slip contains emp name and his other details. Now, if I want to have the same in cassandra, I will have a single cf with emp personal details and his salary details. Is this the right approach? Should we have the employee personal details duplicated each month? Regards, Seenu.
Re: Is there a way to remove a node with Opscenter?
Thanks for the response. I’m trying to remove a node that’s already down for some reason so its not allowing me to decommission it, is there some other way to do this? On Tue, Jul 7, 2015 at 12:45 PM, Kiran mk coolkiran2...@gmail.com wrote: Yes, if your intension is to decommission a node. You can do that by clicking on the node and decommission. Best Regards, Kiran.M.K. On Jul 8, 2015 1:00 AM, Sid Tantia sid.tan...@baseboxsoftware.com wrote: I know you can use `nodetool removenode` from the command line but is there a way to remove a node from a cluster using OpsCenter?
Re: Question on 'Average tombstones per slice' when running cfstats
On Mon, Jul 6, 2015 at 5:38 PM, Jonathan Haddad j...@jonhaddad.com wrote: Wouldn't it suggest a delete heavy workload, rather than update? I consider DELETE a case of UPDATE, but sure, you are correct. :D =Rob
Is there a way to remove a node with Opscenter?
I know you can use `nodetool removenode` from the command line but is there a way to remove a node from a cluster using OpsCenter?
Re: Is there a way to remove a node with Opscenter?
Yes, if your intension is to decommission a node. You can do that by clicking on the node and decommission. Best Regards, Kiran.M.K. On Jul 8, 2015 1:00 AM, Sid Tantia sid.tan...@baseboxsoftware.com wrote: I know you can use `nodetool removenode` from the command line but is there a way to remove a node from a cluster using OpsCenter?
Re: Is there a way to remove a node with Opscenter?
If node is down use : nodetool removenode Host ID We have to run the below command when the node is down if the cluster does not use vnodes, before running the nodetool removenode command, adjust the tokens. If the node is up, then the command would be “nodetool decommission” to remove the node. Remove the node from the “seed list” within the configuration cassandra.yaml. On 7 July 2015 at 12:56, Sid Tantia sid.tan...@baseboxsoftware.com wrote: Thanks for the response. I’m trying to remove a node that’s already down for some reason so its not allowing me to decommission it, is there some other way to do this? On Tue, Jul 7, 2015 at 12:45 PM, Kiran mk coolkiran2...@gmail.com wrote: Yes, if your intension is to decommission a node. You can do that by clicking on the node and decommission. Best Regards, Kiran.M.K. On Jul 8, 2015 1:00 AM, Sid Tantia sid.tan...@baseboxsoftware.com wrote: I know you can use `nodetool removenode` from the command line but is there a way to remove a node from a cluster using OpsCenter?
Re: Is there a way to remove a node with Opscenter?
I tried both `nodetool remove node Host ID` and `nodetool decommission` and they both give the error: nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused’. Here is what I have tried to fix this: 1) Uncommented JVM_OPTS=”$JVM_OPTS -Djava.rmi.server.hostname=public name” 2) Changed rpc_address to 0.0.0.0 3) Restarted cassandra 4) Restarted datastax-agent (Note that I installed my cluster using opscenter so that may have something to do with it? ) On Tue, Jul 7, 2015 at 2:08 PM, Surbhi Gupta surbhi.gupt...@gmail.com wrote: If node is down use : nodetool removenode Host ID We have to run the below command when the node is down if the cluster does not use vnodes, before running the nodetool removenode command, adjust the tokens. If the node is up, then the command would be “nodetool decommission” to remove the node. Remove the node from the “seed list” within the configuration cassandra.yaml. On 7 July 2015 at 12:56, Sid Tantia sid.tan...@baseboxsoftware.com wrote: Thanks for the response. I’m trying to remove a node that’s already down for some reason so its not allowing me to decommission it, is there some other way to do this? On Tue, Jul 7, 2015 at 12:45 PM, Kiran mk coolkiran2...@gmail.com wrote: Yes, if your intension is to decommission a node. You can do that by clicking on the node and decommission. Best Regards, Kiran.M.K. On Jul 8, 2015 1:00 AM, Sid Tantia sid.tan...@baseboxsoftware.com wrote: I know you can use `nodetool removenode` from the command line but is there a way to remove a node from a cluster using OpsCenter?
Re: Is there a way to remove a node with Opscenter?
On Tue, Jul 7, 2015 at 4:39 PM, Sid Tantia sid.tan...@baseboxsoftware.com wrote: I tried both `nodetool remove node Host ID` and `nodetool decommission` and they both give the error: nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused’. Here is what I have tried to fix this: Instead of that stuff, why not : 1) use lsof to determine what IP is being listened to on 7199 by the running process? 2) connect to that IP ? =Rob
DTCS dropping of SST Tables
Hey all, We are using DTCS and we have a ttl of 30 days for all inserts, there are no deletes/updates we do. When the SST tables is dropped by DTCS what kind of logging do we see in C* logs. any help would be useful. The reason is my db size is not hovering around a size it is increasing, there has been no significant change in traffic that creates data in C*. thanks anishek
Re: Example Data Modelling
Hi, I've been following this thread and my thoughts are inline with Carlos' latest response... Model your data to suite your queries. That is one of the data model / design considerations in Cassandra that differs from the RDBMS world. Embrace demoralization and data duplication. Disk space is cheapest, so exploit how your data is laid out in order to optimize for faster reads (which are more costly than writes). Regards, Rory Bramwell Founder and CEO DevOp Services Skype: devopservices Email: rory.bramw...@devopservices.com Web: www.devopservices.com On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote: I guess you're right, using my proposal, getting last employee's record is straightforward and quick, but also, as Peter pointed, getting all slips for a particular month requires you to know all the employee IDs and, ideally, run a query for each employee. This would work depending on how many employees you're managing. At this moment I'm beginning to feel that maybe using both approaches is the best way to go. And I think this is one of Cassandra's recommendations: Write your data in several formats if required to fit your reads. Therefore I'd use my suggestion for getting a salary by employee ID and I'd also have Peter's one to run the end of the month query. Does it make sense? Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote: Thanks for the inputs. Now my question is how should the app populate the duplicate data, i.e., if I have an employee record (along with his FN, LN,..) for the month of Apr and later I am populating the same record for the month of may (with salary changed), should my application first read/fetch the corresponding data for apr and re-insert with modification for month of may? Regards, Seenu. On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote: The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key. The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column. This approach also has drawbacks: 1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB 2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. *From:* Carlos Alonso [mailto:i...@mrcalonso.com] *Sent:* Monday, July 06, 2015 7:04 PM *To:* user@cassandra.apache.org *Subject:* Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote: Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance My use case is to print the Salary slip at the end of each month and the slip contains emp name and his other details. Now, if I want to have the same in cassandra, I will have a single cf with emp personal details and his salary details. Is this the right approach? Should we have the employee personal details duplicated each month? Regards, Seenu.
Re: Compaction issues, 2.0.12
Thanks Rob, Jeff. I have updated the Jira issue with my information. On 6 July 2015 at 23:46, Jeff Ferland j...@tubularlabs.com wrote: I’ve seen the same thing: https://issues.apache.org/jira/browse/CASSANDRA-9577 I’ve had cases where a restart clears the old tables, and I’ve had cases where a restart considers the old tables to be live. On Jul 6, 2015, at 1:51 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Jul 6, 2015 at 1:46 PM, Jeff Williams je...@wherethebitsroam.com wrote: 1) Cassandra version is 2.0.12. 2) Interesting. Looking at JMX org.apache.cassandra.db - ColumnFamilies - trackcontent - track_content - Attributes, I get: LiveDiskSpaceUsed: 17788740448, i.e. ~17GB LiveSSTableCount: 3 TotalDiskSpaceUsed: 55714084629, i.e. ~55GB So it obviously knows about the extra disk space even though the live space looks correct. I couldn't find anything to identify the actual files though. That's what I would expect. 3) So that was even more interesting. After restarting the cassandra daemon, the sstables were not deleted and now the same JMX attributes are: LiveDiskSpaceUsed: 55681040579, i.e. ~55GB LiveSSTableCount: 8 TotalDiskSpaceUsed: 55681040579, i.e. ~55GB So some of my non-live tables are back live again, and obviously some of the big ones!! This is permanently fatal to consistency; sorry if I was not clear enough that if they were not live, there was some risk of Cassandra considering them live again upon restart. If I were you, I would either stop the node and remove the files you know shouldn't be live or do a major compaction ASAP. The behavior you just encountered sounds like a bug, and it is a rather serious one. SSTables which should be dead being marked live is very bad for consistency. Do you see any exceptions in your logs or anything? If you can repro, you should file a JIRA ticket with the apache project... =Rob
Re: Is there a way to remove a node with Opscenter?
On 07/07/2015 07:27 PM, Robert Coli wrote: On Tue, Jul 7, 2015 at 4:39 PM, Sid Tantia sid.tan...@baseboxsoftware.com mailto:sid.tan...@baseboxsoftware.com wrote: I tried both `nodetool remove node Host ID` and `nodetool decommission` and they both give the error: nodetool: Failed to connect to '127.0.0.1:7199 http://127.0.0.1:7199' - ConnectException: 'Connection refused’. Here is what I have tried to fix this: Instead of that stuff, why not : 1) use lsof to determine what IP is being listened to on 7199 by the running process? 2) connect to that IP OP said node was already down/dead: Don't forget the hierarchy of node removal in #cassandra: decommission, removenode, removenode force, assassinate. Escalate in that order. https://twitter.com/faltering/status/559845791741657088 :) Michael
Consistent reads and first write wins
Suppose I have the following schema, CREATE TABLE foo ( id text, time timeuuid, prop1 text, PRIMARY KEY (id, time) ) WITHCLUSTERING ORDER BY (time ASC); And I have two clients who execute quorum writes, e.g., // client 1 INSERT INTO FOO (id, time, prop1) VALUES ('test', time_uuid_1, 'bar'); // client 2 INSERT INTO FOO (id, time, prop1) VALUES ('test', time_uuid_2, 'bam'); If time_uuid_1 comes before time_uuid_2 and if both clients follow up the writes with quorum reads, then will both clients see the value 'bar' for prop1? Are there situations in which clients might see different values? -- - John
How to export query results (milions rows) as CSV fomat?
Hi, Is there any way to export the results of a query (e.g. select * from tbl1 where id =aa and loc =bb) into a file as CSV format? I tried to use COPY command with cqlsh, but the command does not work when you have where condition ?!!! does any have any idea how to do this? best, /Shahab
Re: Example Data Modelling
Hi Jerome, Good point!! Really a nice usage of static columns! BTW, wouldn't the EmpID be static as well? Cheers Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 7 July 2015 at 14:42, Jérôme Mainaud jer...@mainaud.com wrote: Hello, You can slightly adapt Carlos answer to reduce repliation of data that don't change for month to month. Static columns are great for this. The table become: CREATE TABLE salaries ( EmpID varchar, FN varchar *static*, LN varchar *static*, Phone varchar *static*, Address varchar *static*, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) There is only one copy of static column per partition the value is shared between all rows of the partition. When Employee data change you can update it with the partition key in the where clause. When you insert a new month entry you just fill non static columns. The table can be queried the same way as the original one. Cheers -- Jérôme Mainaud jer...@mainaud.com 2015-07-07 11:51 GMT+02:00 Rory Bramwell, DevOp Services rory.bramw...@devopservices.com: Hi, I've been following this thread and my thoughts are inline with Carlos' latest response... Model your data to suite your queries. That is one of the data model / design considerations in Cassandra that differs from the RDBMS world. Embrace demoralization and data duplication. Disk space is cheapest, so exploit how your data is laid out in order to optimize for faster reads (which are more costly than writes). Regards, Rory Bramwell Founder and CEO DevOp Services Skype: devopservices Email: rory.bramw...@devopservices.com Web: www.devopservices.com On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote: I guess you're right, using my proposal, getting last employee's record is straightforward and quick, but also, as Peter pointed, getting all slips for a particular month requires you to know all the employee IDs and, ideally, run a query for each employee. This would work depending on how many employees you're managing. At this moment I'm beginning to feel that maybe using both approaches is the best way to go. And I think this is one of Cassandra's recommendations: Write your data in several formats if required to fit your reads. Therefore I'd use my suggestion for getting a salary by employee ID and I'd also have Peter's one to run the end of the month query. Does it make sense? Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote: Thanks for the inputs. Now my question is how should the app populate the duplicate data, i.e., if I have an employee record (along with his FN, LN,..) for the month of Apr and later I am populating the same record for the month of may (with salary changed), should my application first read/fetch the corresponding data for apr and re-insert with modification for month of may? Regards, Seenu. On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote: The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key. The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column. This approach also has drawbacks: 1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB 2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. *From:* Carlos Alonso [mailto:i...@mrcalonso.com] *Sent:* Monday, July 06, 2015 7:04 PM *To:* user@cassandra.apache.org *Subject:* Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote: Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance
Re: Example Data Modelling
25 MB seems very specific. Is there a reason why? On Tuesday, July 7, 2015, Peer, Oded oded.p...@rsa.com wrote: The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key. The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column. This approach also has drawbacks: 1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB 2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. *From:* Carlos Alonso [mailto:i...@mrcalonso.com javascript:_e(%7B%7D,'cvml','i...@mrcalonso.com');] *Sent:* Monday, July 06, 2015 7:04 PM *To:* user@cassandra.apache.org javascript:_e(%7B%7D,'cvml','user@cassandra.apache.org'); *Subject:* Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com javascript:_e(%7B%7D,'cvml','seen...@gmail.com'); wrote: Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance My use case is to print the Salary slip at the end of each month and the slip contains emp name and his other details. Now, if I want to have the same in cassandra, I will have a single cf with emp personal details and his salary details. Is this the right approach? Should we have the employee personal details duplicated each month? Regards, Seenu. -- - John
Re: Example Data Modelling
Hello, You can slightly adapt Carlos answer to reduce repliation of data that don't change for month to month. Static columns are great for this. The table become: CREATE TABLE salaries ( EmpID varchar, FN varchar *static*, LN varchar *static*, Phone varchar *static*, Address varchar *static*, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) There is only one copy of static column per partition the value is shared between all rows of the partition. When Employee data change you can update it with the partition key in the where clause. When you insert a new month entry you just fill non static columns. The table can be queried the same way as the original one. Cheers -- Jérôme Mainaud jer...@mainaud.com 2015-07-07 11:51 GMT+02:00 Rory Bramwell, DevOp Services rory.bramw...@devopservices.com: Hi, I've been following this thread and my thoughts are inline with Carlos' latest response... Model your data to suite your queries. That is one of the data model / design considerations in Cassandra that differs from the RDBMS world. Embrace demoralization and data duplication. Disk space is cheapest, so exploit how your data is laid out in order to optimize for faster reads (which are more costly than writes). Regards, Rory Bramwell Founder and CEO DevOp Services Skype: devopservices Email: rory.bramw...@devopservices.com Web: www.devopservices.com On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote: I guess you're right, using my proposal, getting last employee's record is straightforward and quick, but also, as Peter pointed, getting all slips for a particular month requires you to know all the employee IDs and, ideally, run a query for each employee. This would work depending on how many employees you're managing. At this moment I'm beginning to feel that maybe using both approaches is the best way to go. And I think this is one of Cassandra's recommendations: Write your data in several formats if required to fit your reads. Therefore I'd use my suggestion for getting a salary by employee ID and I'd also have Peter's one to run the end of the month query. Does it make sense? Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote: Thanks for the inputs. Now my question is how should the app populate the duplicate data, i.e., if I have an employee record (along with his FN, LN,..) for the month of Apr and later I am populating the same record for the month of may (with salary changed), should my application first read/fetch the corresponding data for apr and re-insert with modification for month of may? Regards, Seenu. On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote: The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key. The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column. This approach also has drawbacks: 1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB 2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. *From:* Carlos Alonso [mailto:i...@mrcalonso.com] *Sent:* Monday, July 06, 2015 7:04 PM *To:* user@cassandra.apache.org *Subject:* Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote: Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance My use case is to print the Salary slip at the end of each month and the slip contains emp name and his other details. Now, if I want to have the same in cassandra, I will have a single cf with emp personal details and his salary details. Is this the right approach?
[ANNOUNCE] YCSB 0.2.0 Release
On behalf of the development community, I am pleased to announce the release of YCSB version 0.2.0. Highlights: * Apache Cassandra 2.0 CQL support * Apache HBase 1.0 support * Apache Accumulo 1.6 support * MongoDB - support for all production versions released since 2011 * Tarantool 1.6 support * ~5 additional datastore bindings in experimental status * Optional support for latency collection via HdrHistogram * Optional support for fixing coordinated omission Full release notes, including links to source and convenience binaries: https://github.com/brianfrankcooper/YCSB/releases/tag/0.2.0 This release is the first from the project in 3.5 years, so I'd recommend reading the release notes if you're a user. The project is moving to a monthly release cadence, so hopefully future releases will be easier to incrementally consume. -- Sean
Streaming data to Cassandra with Hadoop
Hello, I'm loading data from HDFS Cassandra using Spotify's hdfs2cass. The setup is a 4-node cluster running Cassandra 2.1.6, RF=2, STCS, raw data size is about 1tb before loading and 3.8tb after. The process works fine, but I do have a few questions. 1. Some Hadoop jobs fail due to streaming timeouts. That's fine, because subsequent attempts succeed, but why do I get the timeouts in the first place? Would this be something network-related or does Cassandra have a limit on how much streaming it can handle? 2. The server logs show errors like the one quoted below, for malformed input around byte N -- ERROR [STREAM-IN-/10.84.30.209] 2015-07-06 11:30:10,915 StreamSession.java:499 - [Stream #e1e4f470-23fb-11e5-9c95-9b249a189cad] Streaming error occurred java.io.UTFDataFormatException: malformed input around byte 10 at java.io.DataInputStream.readUTF(DataInputStream.java:656) ~[na:1.7.0_67] at java.io.DataInputStream.readUTF(DataInputStream.java:564) ~[na:1.7.0_67] at org.apache.cassandra.streaming.messages.FileMessageHeader$FileMessageHeaderSerializer.deserialize(FileMessageHeader.java:143) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.streaming.messages.FileMessageHeader$FileMessageHeaderSerializer.deserialize(FileMessageHeader.java:120) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:42) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:38) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:55) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:250) ~[apache-cassandra-2.1.6.jar:2.1.6] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67] Is this a familiar issue? I'd expect the data to be the same across all streaming attempts. The timeouts I can theorize about, any thoughts on what might be causing these though? Is it normal? 3. About compaction. There's a RESTful service in front of Cassandra and I see the average response time is positively correlated with the number of compactions pending (it drops as they drop). Is there a way to stream such that the number of compactions once the streaming is done is minimal? 4. Also about compaction: I understand that while STCS is write-optimized and reduces the number of SSTables, LCS is read-optimized and might increase it. The aforementioned service needs read-only access to Cassandra. Loading with LCS resulted in an order of magnitude more compactions and dramatically higher server load. Given I want minimal response time ASAP, what approach should I be taking? Right now I load with STCS, wait for compactions to finish, and I consider a switch to LCS once it's done. Does it make sense? Any thoughts on improving this process? (Ideally - is there anything close to a one-shot process where compaction is barely required?) I'll gladly provide additional information if needed. I'll also be happy to hear about others' experience in similar scenarios. Thanks, Elad