Re: Example Data Modelling
Hi Carlos, Columns in primary key like EmpID can't be static. But remind that EmpID as in the partition key is not duplicated. -- Jérôme Mainaud jer...@mainaud.com 2015-07-07 16:02 GMT+02:00 Carlos Alonso i...@mrcalonso.com: Hi Jerome, Good point!! Really a nice usage of static columns! BTW, wouldn't the EmpID be static as well? Cheers Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 7 July 2015 at 14:42, Jérôme Mainaud jer...@mainaud.com wrote: Hello, You can slightly adapt Carlos answer to reduce repliation of data that don't change for month to month. Static columns are great for this. The table become: CREATE TABLE salaries ( EmpID varchar, FN varchar *static*, LN varchar *static*, Phone varchar *static*, Address varchar *static*, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) There is only one copy of static column per partition the value is shared between all rows of the partition. When Employee data change you can update it with the partition key in the where clause. When you insert a new month entry you just fill non static columns. The table can be queried the same way as the original one. Cheers -- Jérôme Mainaud jer...@mainaud.com 2015-07-07 11:51 GMT+02:00 Rory Bramwell, DevOp Services rory.bramw...@devopservices.com: Hi, I've been following this thread and my thoughts are inline with Carlos' latest response... Model your data to suite your queries. That is one of the data model / design considerations in Cassandra that differs from the RDBMS world. Embrace demoralization and data duplication. Disk space is cheapest, so exploit how your data is laid out in order to optimize for faster reads (which are more costly than writes). Regards, Rory Bramwell Founder and CEO DevOp Services Skype: devopservices Email: rory.bramw...@devopservices.com Web: www.devopservices.com On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote: I guess you're right, using my proposal, getting last employee's record is straightforward and quick, but also, as Peter pointed, getting all slips for a particular month requires you to know all the employee IDs and, ideally, run a query for each employee. This would work depending on how many employees you're managing. At this moment I'm beginning to feel that maybe using both approaches is the best way to go. And I think this is one of Cassandra's recommendations: Write your data in several formats if required to fit your reads. Therefore I'd use my suggestion for getting a salary by employee ID and I'd also have Peter's one to run the end of the month query. Does it make sense? Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote: Thanks for the inputs. Now my question is how should the app populate the duplicate data, i.e., if I have an employee record (along with his FN, LN,..) for the month of Apr and later I am populating the same record for the month of may (with salary changed), should my application first read/fetch the corresponding data for apr and re-insert with modification for month of may? Regards, Seenu. On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote: The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key. The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column. This approach also has drawbacks: 1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB 2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. *From:* Carlos Alonso [mailto:i...@mrcalonso.com] *Sent:* Monday, July 06, 2015 7:04 PM *To:* user@cassandra.apache.org *Subject:* Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com
Re: Example Data Modelling
If going by Month as partition key then you need to duplicate the data. I dont think going with name as partition key is good datamodel practice as it will create a hotspot. Also I believe your queries will be mostly by employee not by month. You can create employee id as partition key and month as clustering and keep employee details as static columns so they wont be repeated Naidu Saladi From: Srinivasa T N seen...@gmail.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Tuesday, July 7, 2015 3:07 AM Subject: Re: Example Data Modelling Thanks for the inputs. Now my question is how should the app populate the duplicate data, i.e., if I have an employee record (along with his FN, LN,..) for the month of Apr and later I am populating the same record for the month of may (with salary changed), should my application first read/fetch the corresponding data for apr and re-insert with modification for month of may? Regards, Seenu. On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote: The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key.The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column.This approach also has drawbacks:1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. From: Carlos Alonso [mailto:i...@mrcalonso.com] Sent: Monday, July 06, 2015 7:04 PM To: user@cassandra.apache.org Subject: Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like thisCREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month)) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps,Cheers! Carlos Alonso | Software Engineer | @calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote:Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance My use case is to print the Salary slip at the end of each month and the slip contains emp name and his other details. Now, if I want to have the same in cassandra, I will have a single cf with emp personal details and his salary details. Is this the right approach? Should we have the employee personal details duplicated each month? Regards, Seenu.
RE: Example Data Modelling
The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key. The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column. This approach also has drawbacks: 1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB 2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. From: Carlos Alonso [mailto:i...@mrcalonso.com] Sent: Monday, July 06, 2015 7:04 PM To: user@cassandra.apache.org Subject: Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.commailto:seen...@gmail.com wrote: Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance My use case is to print the Salary slip at the end of each month and the slip contains emp name and his other details. Now, if I want to have the same in cassandra, I will have a single cf with emp personal details and his salary details. Is this the right approach? Should we have the employee personal details duplicated each month? Regards, Seenu.
Re: Example Data Modelling
Thanks for the inputs. Now my question is how should the app populate the duplicate data, i.e., if I have an employee record (along with his FN, LN,..) for the month of Apr and later I am populating the same record for the month of may (with salary changed), should my application first read/fetch the corresponding data for apr and re-insert with modification for month of may? Regards, Seenu. On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote: The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key. The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column. This approach also has drawbacks: 1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB 2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. *From:* Carlos Alonso [mailto:i...@mrcalonso.com] *Sent:* Monday, July 06, 2015 7:04 PM *To:* user@cassandra.apache.org *Subject:* Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote: Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance My use case is to print the Salary slip at the end of each month and the slip contains emp name and his other details. Now, if I want to have the same in cassandra, I will have a single cf with emp personal details and his salary details. Is this the right approach? Should we have the employee personal details duplicated each month? Regards, Seenu.
Re: Example Data Modelling
I guess you're right, using my proposal, getting last employee's record is straightforward and quick, but also, as Peter pointed, getting all slips for a particular month requires you to know all the employee IDs and, ideally, run a query for each employee. This would work depending on how many employees you're managing. At this moment I'm beginning to feel that maybe using both approaches is the best way to go. And I think this is one of Cassandra's recommendations: Write your data in several formats if required to fit your reads. Therefore I'd use my suggestion for getting a salary by employee ID and I'd also have Peter's one to run the end of the month query. Does it make sense? Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote: Thanks for the inputs. Now my question is how should the app populate the duplicate data, i.e., if I have an employee record (along with his FN, LN,..) for the month of Apr and later I am populating the same record for the month of may (with salary changed), should my application first read/fetch the corresponding data for apr and re-insert with modification for month of may? Regards, Seenu. On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote: The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key. The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column. This approach also has drawbacks: 1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB 2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. *From:* Carlos Alonso [mailto:i...@mrcalonso.com] *Sent:* Monday, July 06, 2015 7:04 PM *To:* user@cassandra.apache.org *Subject:* Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote: Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance My use case is to print the Salary slip at the end of each month and the slip contains emp name and his other details. Now, if I want to have the same in cassandra, I will have a single cf with emp personal details and his salary details. Is this the right approach? Should we have the employee personal details duplicated each month? Regards, Seenu.
Re: Example Data Modelling
Hi, I've been following this thread and my thoughts are inline with Carlos' latest response... Model your data to suite your queries. That is one of the data model / design considerations in Cassandra that differs from the RDBMS world. Embrace demoralization and data duplication. Disk space is cheapest, so exploit how your data is laid out in order to optimize for faster reads (which are more costly than writes). Regards, Rory Bramwell Founder and CEO DevOp Services Skype: devopservices Email: rory.bramw...@devopservices.com Web: www.devopservices.com On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote: I guess you're right, using my proposal, getting last employee's record is straightforward and quick, but also, as Peter pointed, getting all slips for a particular month requires you to know all the employee IDs and, ideally, run a query for each employee. This would work depending on how many employees you're managing. At this moment I'm beginning to feel that maybe using both approaches is the best way to go. And I think this is one of Cassandra's recommendations: Write your data in several formats if required to fit your reads. Therefore I'd use my suggestion for getting a salary by employee ID and I'd also have Peter's one to run the end of the month query. Does it make sense? Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote: Thanks for the inputs. Now my question is how should the app populate the duplicate data, i.e., if I have an employee record (along with his FN, LN,..) for the month of Apr and later I am populating the same record for the month of may (with salary changed), should my application first read/fetch the corresponding data for apr and re-insert with modification for month of may? Regards, Seenu. On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote: The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key. The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column. This approach also has drawbacks: 1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB 2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. *From:* Carlos Alonso [mailto:i...@mrcalonso.com] *Sent:* Monday, July 06, 2015 7:04 PM *To:* user@cassandra.apache.org *Subject:* Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote: Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance My use case is to print the Salary slip at the end of each month and the slip contains emp name and his other details. Now, if I want to have the same in cassandra, I will have a single cf with emp personal details and his salary details. Is this the right approach? Should we have the employee personal details duplicated each month? Regards, Seenu.
Re: Example Data Modelling
Hi Jerome, Good point!! Really a nice usage of static columns! BTW, wouldn't the EmpID be static as well? Cheers Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 7 July 2015 at 14:42, Jérôme Mainaud jer...@mainaud.com wrote: Hello, You can slightly adapt Carlos answer to reduce repliation of data that don't change for month to month. Static columns are great for this. The table become: CREATE TABLE salaries ( EmpID varchar, FN varchar *static*, LN varchar *static*, Phone varchar *static*, Address varchar *static*, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) There is only one copy of static column per partition the value is shared between all rows of the partition. When Employee data change you can update it with the partition key in the where clause. When you insert a new month entry you just fill non static columns. The table can be queried the same way as the original one. Cheers -- Jérôme Mainaud jer...@mainaud.com 2015-07-07 11:51 GMT+02:00 Rory Bramwell, DevOp Services rory.bramw...@devopservices.com: Hi, I've been following this thread and my thoughts are inline with Carlos' latest response... Model your data to suite your queries. That is one of the data model / design considerations in Cassandra that differs from the RDBMS world. Embrace demoralization and data duplication. Disk space is cheapest, so exploit how your data is laid out in order to optimize for faster reads (which are more costly than writes). Regards, Rory Bramwell Founder and CEO DevOp Services Skype: devopservices Email: rory.bramw...@devopservices.com Web: www.devopservices.com On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote: I guess you're right, using my proposal, getting last employee's record is straightforward and quick, but also, as Peter pointed, getting all slips for a particular month requires you to know all the employee IDs and, ideally, run a query for each employee. This would work depending on how many employees you're managing. At this moment I'm beginning to feel that maybe using both approaches is the best way to go. And I think this is one of Cassandra's recommendations: Write your data in several formats if required to fit your reads. Therefore I'd use my suggestion for getting a salary by employee ID and I'd also have Peter's one to run the end of the month query. Does it make sense? Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote: Thanks for the inputs. Now my question is how should the app populate the duplicate data, i.e., if I have an employee record (along with his FN, LN,..) for the month of Apr and later I am populating the same record for the month of may (with salary changed), should my application first read/fetch the corresponding data for apr and re-insert with modification for month of may? Regards, Seenu. On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote: The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key. The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column. This approach also has drawbacks: 1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB 2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. *From:* Carlos Alonso [mailto:i...@mrcalonso.com] *Sent:* Monday, July 06, 2015 7:04 PM *To:* user@cassandra.apache.org *Subject:* Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote: Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance
Re: Example Data Modelling
25 MB seems very specific. Is there a reason why? On Tuesday, July 7, 2015, Peer, Oded oded.p...@rsa.com wrote: The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key. The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column. This approach also has drawbacks: 1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB 2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. *From:* Carlos Alonso [mailto:i...@mrcalonso.com javascript:_e(%7B%7D,'cvml','i...@mrcalonso.com');] *Sent:* Monday, July 06, 2015 7:04 PM *To:* user@cassandra.apache.org javascript:_e(%7B%7D,'cvml','user@cassandra.apache.org'); *Subject:* Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com javascript:_e(%7B%7D,'cvml','seen...@gmail.com'); wrote: Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance My use case is to print the Salary slip at the end of each month and the slip contains emp name and his other details. Now, if I want to have the same in cassandra, I will have a single cf with emp personal details and his salary details. Is this the right approach? Should we have the employee personal details duplicated each month? Regards, Seenu. -- - John
Re: Example Data Modelling
Hello, You can slightly adapt Carlos answer to reduce repliation of data that don't change for month to month. Static columns are great for this. The table become: CREATE TABLE salaries ( EmpID varchar, FN varchar *static*, LN varchar *static*, Phone varchar *static*, Address varchar *static*, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) There is only one copy of static column per partition the value is shared between all rows of the partition. When Employee data change you can update it with the partition key in the where clause. When you insert a new month entry you just fill non static columns. The table can be queried the same way as the original one. Cheers -- Jérôme Mainaud jer...@mainaud.com 2015-07-07 11:51 GMT+02:00 Rory Bramwell, DevOp Services rory.bramw...@devopservices.com: Hi, I've been following this thread and my thoughts are inline with Carlos' latest response... Model your data to suite your queries. That is one of the data model / design considerations in Cassandra that differs from the RDBMS world. Embrace demoralization and data duplication. Disk space is cheapest, so exploit how your data is laid out in order to optimize for faster reads (which are more costly than writes). Regards, Rory Bramwell Founder and CEO DevOp Services Skype: devopservices Email: rory.bramw...@devopservices.com Web: www.devopservices.com On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote: I guess you're right, using my proposal, getting last employee's record is straightforward and quick, but also, as Peter pointed, getting all slips for a particular month requires you to know all the employee IDs and, ideally, run a query for each employee. This would work depending on how many employees you're managing. At this moment I'm beginning to feel that maybe using both approaches is the best way to go. And I think this is one of Cassandra's recommendations: Write your data in several formats if required to fit your reads. Therefore I'd use my suggestion for getting a salary by employee ID and I'd also have Peter's one to run the end of the month query. Does it make sense? Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote: Thanks for the inputs. Now my question is how should the app populate the duplicate data, i.e., if I have an employee record (along with his FN, LN,..) for the month of Apr and later I am populating the same record for the month of may (with salary changed), should my application first read/fetch the corresponding data for apr and re-insert with modification for month of may? Regards, Seenu. On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote: The data model suggested isn’t optimal for the “end of month” query you want to run since you are not querying by partition key. The query would look like “select EmpID, FN, LN, basic from salaries where month = 1” which requires filtering and has unpredictable performance. For this type of query to be fast you can use the “month” column as the partition key and the “EmpID” and the clustering column. This approach also has drawbacks: 1. This data model creates a wide row. Depending on the number of employees this partition might be very large. You should limit partition sizes to 25MB 2. Distributing data according to month means that only a small number of nodes will hold all of the salary data for a specific month which might cause hotspots on those nodes. Choose the approach that works best for you. *From:* Carlos Alonso [mailto:i...@mrcalonso.com] *Sent:* Monday, July 06, 2015 7:04 PM *To:* user@cassandra.apache.org *Subject:* Re: Example Data Modelling Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote: Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance My use case is to print the Salary slip at the end of each month and the slip contains emp name and his other details. Now, if I want to have the same in cassandra, I will have a single cf with emp personal details and his salary details. Is this the right approach
Example Data Modelling
Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance My use case is to print the Salary slip at the end of each month and the slip contains emp name and his other details. Now, if I want to have the same in cassandra, I will have a single cf with emp personal details and his salary details. Is this the right approach? Should we have the employee personal details duplicated each month? Regards, Seenu.
Re: Example Data Modelling
Hi Srinivasa, I think you're right, In Cassandra you should favor denormalisation when in RDBMS you find a relationship like this. I'd suggest a cf like this CREATE TABLE salaries ( EmpID varchar, FN varchar, LN varchar, Phone varchar, Address varchar, month integer, basic integer, flexible_allowance float, PRIMARY KEY(EmpID, month) ) That way the salaries will be partitioned by EmpID and clustered by month, which I guess is the natural sorting you want. Hope it helps, Cheers! Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote: Hi, I have basic doubt: I have an RDBMS with the following two tables: Emp - EmpID, FN, LN, Phone, Address Sal - Month, Empid, Basic, Flexible Allowance My use case is to print the Salary slip at the end of each month and the slip contains emp name and his other details. Now, if I want to have the same in cassandra, I will have a single cf with emp personal details and his salary details. Is this the right approach? Should we have the employee personal details duplicated each month? Regards, Seenu.