Re: Example Data Modelling

2015-07-11 Thread Jérôme Mainaud
Hi Carlos,

Columns in primary key like EmpID can't be static.
But remind that EmpID as in the partition key is not duplicated.

-- 
Jérôme Mainaud
jer...@mainaud.com

2015-07-07 16:02 GMT+02:00 Carlos Alonso i...@mrcalonso.com:

 Hi Jerome,

 Good point!! Really a nice usage of static columns! BTW, wouldn't the
 EmpID be static as well?

 Cheers

 Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso

 On 7 July 2015 at 14:42, Jérôme Mainaud jer...@mainaud.com wrote:

 Hello,

 You can slightly adapt Carlos answer to reduce repliation of data that
 don't change for month to month.
 Static columns are great for this.

 The table become:

 CREATE TABLE salaries (
   EmpID varchar,
   FN varchar *static*,
   LN varchar *static*,
   Phone varchar *static*,
   Address varchar *static*,
   month integer,
   basic integer,
   flexible_allowance float,
   PRIMARY KEY(EmpID, month)
 )

 There is only one copy of static column per partition the value is shared
 between all rows of the partition.
 When Employee data change you can update it with the partition key in the
 where clause.
 When you insert a new month entry you just fill non static columns.
 The table can be queried the same way as the original one.

 Cheers



 --
 Jérôme Mainaud
 jer...@mainaud.com

 2015-07-07 11:51 GMT+02:00 Rory Bramwell, DevOp Services 
 rory.bramw...@devopservices.com:

 Hi,

 I've been following this thread and my thoughts are inline with Carlos'
 latest response... Model your data to suite your queries. That is one of
 the data model / design considerations in Cassandra that differs from the
 RDBMS world. Embrace demoralization and data duplication. Disk space is
 cheapest, so exploit how your data is laid out in order to optimize for
 faster reads (which are more costly than writes).

 Regards,

 Rory Bramwell
 Founder and CEO
 DevOp Services

 Skype: devopservices
 Email: rory.bramw...@devopservices.com
 Web: www.devopservices.com
 On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote:

 I guess you're right, using my proposal, getting last employee's record
 is straightforward and quick, but also, as Peter pointed, getting all slips
 for a particular month requires you to know all the employee IDs and,
 ideally, run a query for each employee. This would work depending on how
 many employees you're managing.

 At this moment I'm beginning to feel that maybe using both approaches
 is the best way to go. And I think this is one of Cassandra's
 recommendations: Write your data in several formats if required to fit your
 reads. Therefore I'd use my suggestion for getting a salary by employee ID
 and I'd also have Peter's one to run the end of the month query.
 Does it make sense?

 Cheers!

 Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso

 On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote:

 Thanks for the inputs.

 Now my question is how should the app populate the duplicate data,
 i.e., if I have an employee record (along with his FN, LN,..) for the 
 month
 of Apr and later I am populating the same record for the month of may 
 (with
 salary changed), should my application first read/fetch the corresponding
 data for apr and re-insert with modification for month of may?

 Regards,
 Seenu.

 On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote:

  The data model suggested isn’t optimal for the “end of month” query
 you want to run since you are not querying by partition key.

 The query would look like “select EmpID, FN, LN, basic from salaries
 where month = 1” which requires filtering and has unpredictable 
 performance.



 For this type of query to be fast you can use the “month” column as
 the partition key and the “EmpID” and the clustering column.

 This approach also has drawbacks:

 1. This data model creates a wide row. Depending on the number of
 employees this partition might be very large. You should limit partition
 sizes to 25MB

 2. Distributing data according to month means that only a small
 number of nodes will hold all of the salary data for a specific month 
 which
 might cause hotspots on those nodes.



 Choose the approach that works best for you.





 *From:* Carlos Alonso [mailto:i...@mrcalonso.com]
 *Sent:* Monday, July 06, 2015 7:04 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Example Data Modelling



 Hi Srinivasa,



 I think you're right, In Cassandra you should favor denormalisation
 when in RDBMS you find a relationship like this.



 I'd suggest a cf like this

 CREATE TABLE salaries (

   EmpID varchar,

   FN varchar,

   LN varchar,

   Phone varchar,

   Address varchar,

   month integer,

   basic integer,

   flexible_allowance float,

   PRIMARY KEY(EmpID, month)

 )



 That way the salaries will be partitioned by EmpID and clustered by
 month, which I guess is the natural sorting you want.



 Hope it helps,

 Cheers!


   Carlos Alonso | Software Engineer | @calonso
 https://twitter.com

Re: Example Data Modelling

2015-07-08 Thread Saladi Naidu
If going by Month as partition key then you need to duplicate the data. I dont 
think going with name as partition key is good datamodel practice as it will 
create a hotspot. Also I believe your queries will be mostly by employee not by 
month. 
You can create employee id as partition key and month as clustering and keep 
employee details as static columns so they wont be repeated  Naidu Saladi 

  From: Srinivasa T N seen...@gmail.com
 To: user@cassandra.apache.org user@cassandra.apache.org 
 Sent: Tuesday, July 7, 2015 3:07 AM
 Subject: Re: Example Data Modelling
   
Thanks for the inputs.

Now my question is how should the app populate the duplicate data, i.e., if I 
have an employee record (along with his FN, LN,..) for the month of Apr and 
later I am populating the same record for the month of may (with salary 
changed), should my application first read/fetch the corresponding data for apr 
and re-insert with modification for month of may?

Regards,
Seenu.



On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote:

The data model suggested isn’t optimal for the “end of month” query you want to 
run since you are not querying by partition key.The query would look like 
“select EmpID, FN, LN, basic from salaries where month = 1” which requires 
filtering and has unpredictable performance. For this type of query to be fast 
you can use the “month” column as the partition key and the “EmpID” and the 
clustering column.This approach also has drawbacks:1. This data model creates a 
wide row. Depending on the number of employees this partition might be very 
large. You should limit partition sizes to 25MB2. Distributing data according 
to month means that only a small number of nodes will hold all of the salary 
data for a specific month which might cause hotspots on those nodes. Choose the 
approach that works best for you.  From: Carlos Alonso 
[mailto:i...@mrcalonso.com]
Sent: Monday, July 06, 2015 7:04 PM
To: user@cassandra.apache.org
Subject: Re: Example Data Modelling Hi Srinivasa, I think you're right, In 
Cassandra you should favor denormalisation when in RDBMS you find a 
relationship like this. I'd suggest a cf like thisCREATE TABLE salaries (  
EmpID varchar,  FN varchar,  LN varchar,  Phone varchar,  Address varchar,  
month integer,  basic integer,  flexible_allowance float,  PRIMARY KEY(EmpID, 
month)) That way the salaries will be partitioned by EmpID and clustered by 
month, which I guess is the natural sorting you want. Hope it helps,Cheers!
Carlos Alonso | Software Engineer | @calonso On 6 July 2015 at 13:01, Srinivasa 
T N seen...@gmail.com wrote:Hi,   I have basic doubt: I have an RDBMS with 
the following two tables:

   Emp - EmpID, FN, LN, Phone, Address
   Sal - Month, Empid, Basic, Flexible Allowance

   My use case is to print the Salary slip at the end of each month and the 
slip contains emp name and his other details.

   Now, if I want to have the same in cassandra, I will have a single cf with 
emp personal details and his salary details.  Is this the right approach?  
Should we have the employee personal details duplicated each month?

Regards,
Seenu. 



  

RE: Example Data Modelling

2015-07-07 Thread Peer, Oded
The data model suggested isn’t optimal for the “end of month” query you want to 
run since you are not querying by partition key.
The query would look like “select EmpID, FN, LN, basic from salaries where 
month = 1” which requires filtering and has unpredictable performance.

For this type of query to be fast you can use the “month” column as the 
partition key and the “EmpID” and the clustering column.
This approach also has drawbacks:
1. This data model creates a wide row. Depending on the number of employees 
this partition might be very large. You should limit partition sizes to 25MB
2. Distributing data according to month means that only a small number of nodes 
will hold all of the salary data for a specific month which might cause 
hotspots on those nodes.

Choose the approach that works best for you.


From: Carlos Alonso [mailto:i...@mrcalonso.com]
Sent: Monday, July 06, 2015 7:04 PM
To: user@cassandra.apache.org
Subject: Re: Example Data Modelling

Hi Srinivasa,

I think you're right, In Cassandra you should favor denormalisation when in 
RDBMS you find a relationship like this.

I'd suggest a cf like this
CREATE TABLE salaries (
  EmpID varchar,
  FN varchar,
  LN varchar,
  Phone varchar,
  Address varchar,
  month integer,
  basic integer,
  flexible_allowance float,
  PRIMARY KEY(EmpID, month)
)

That way the salaries will be partitioned by EmpID and clustered by month, 
which I guess is the natural sorting you want.

Hope it helps,
Cheers!

Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso

On 6 July 2015 at 13:01, Srinivasa T N 
seen...@gmail.commailto:seen...@gmail.com wrote:
Hi,
   I have basic doubt: I have an RDBMS with the following two tables:

   Emp - EmpID, FN, LN, Phone, Address
   Sal - Month, Empid, Basic, Flexible Allowance

   My use case is to print the Salary slip at the end of each month and the 
slip contains emp name and his other details.

   Now, if I want to have the same in cassandra, I will have a single cf with 
emp personal details and his salary details.  Is this the right approach?  
Should we have the employee personal details duplicated each month?

Regards,
Seenu.



Re: Example Data Modelling

2015-07-07 Thread Srinivasa T N
Thanks for the inputs.

Now my question is how should the app populate the duplicate data, i.e., if
I have an employee record (along with his FN, LN,..) for the month of Apr
and later I am populating the same record for the month of may (with salary
changed), should my application first read/fetch the corresponding data for
apr and re-insert with modification for month of may?

Regards,
Seenu.

On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote:

  The data model suggested isn’t optimal for the “end of month” query you
 want to run since you are not querying by partition key.

 The query would look like “select EmpID, FN, LN, basic from salaries where
 month = 1” which requires filtering and has unpredictable performance.



 For this type of query to be fast you can use the “month” column as the
 partition key and the “EmpID” and the clustering column.

 This approach also has drawbacks:

 1. This data model creates a wide row. Depending on the number of
 employees this partition might be very large. You should limit partition
 sizes to 25MB

 2. Distributing data according to month means that only a small number of
 nodes will hold all of the salary data for a specific month which might
 cause hotspots on those nodes.



 Choose the approach that works best for you.





 *From:* Carlos Alonso [mailto:i...@mrcalonso.com]
 *Sent:* Monday, July 06, 2015 7:04 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Example Data Modelling



 Hi Srinivasa,



 I think you're right, In Cassandra you should favor denormalisation when
 in RDBMS you find a relationship like this.



 I'd suggest a cf like this

 CREATE TABLE salaries (

   EmpID varchar,

   FN varchar,

   LN varchar,

   Phone varchar,

   Address varchar,

   month integer,

   basic integer,

   flexible_allowance float,

   PRIMARY KEY(EmpID, month)

 )



 That way the salaries will be partitioned by EmpID and clustered by month,
 which I guess is the natural sorting you want.



 Hope it helps,

 Cheers!


   Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso



 On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote:

 Hi,

I have basic doubt: I have an RDBMS with the following two tables:

Emp - EmpID, FN, LN, Phone, Address
Sal - Month, Empid, Basic, Flexible Allowance

My use case is to print the Salary slip at the end of each month and
 the slip contains emp name and his other details.

Now, if I want to have the same in cassandra, I will have a single cf
 with emp personal details and his salary details.  Is this the right
 approach?  Should we have the employee personal details duplicated each
 month?

 Regards,
 Seenu.





Re: Example Data Modelling

2015-07-07 Thread Carlos Alonso
I guess you're right, using my proposal, getting last employee's record is
straightforward and quick, but also, as Peter pointed, getting all slips
for a particular month requires you to know all the employee IDs and,
ideally, run a query for each employee. This would work depending on how
many employees you're managing.

At this moment I'm beginning to feel that maybe using both approaches is
the best way to go. And I think this is one of Cassandra's recommendations:
Write your data in several formats if required to fit your reads. Therefore
I'd use my suggestion for getting a salary by employee ID and I'd also have
Peter's one to run the end of the month query.
Does it make sense?

Cheers!

Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso

On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote:

 Thanks for the inputs.

 Now my question is how should the app populate the duplicate data, i.e.,
 if I have an employee record (along with his FN, LN,..) for the month of
 Apr and later I am populating the same record for the month of may (with
 salary changed), should my application first read/fetch the corresponding
 data for apr and re-insert with modification for month of may?

 Regards,
 Seenu.

 On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote:

  The data model suggested isn’t optimal for the “end of month” query you
 want to run since you are not querying by partition key.

 The query would look like “select EmpID, FN, LN, basic from salaries
 where month = 1” which requires filtering and has unpredictable performance.



 For this type of query to be fast you can use the “month” column as the
 partition key and the “EmpID” and the clustering column.

 This approach also has drawbacks:

 1. This data model creates a wide row. Depending on the number of
 employees this partition might be very large. You should limit partition
 sizes to 25MB

 2. Distributing data according to month means that only a small number of
 nodes will hold all of the salary data for a specific month which might
 cause hotspots on those nodes.



 Choose the approach that works best for you.





 *From:* Carlos Alonso [mailto:i...@mrcalonso.com]
 *Sent:* Monday, July 06, 2015 7:04 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Example Data Modelling



 Hi Srinivasa,



 I think you're right, In Cassandra you should favor denormalisation when
 in RDBMS you find a relationship like this.



 I'd suggest a cf like this

 CREATE TABLE salaries (

   EmpID varchar,

   FN varchar,

   LN varchar,

   Phone varchar,

   Address varchar,

   month integer,

   basic integer,

   flexible_allowance float,

   PRIMARY KEY(EmpID, month)

 )



 That way the salaries will be partitioned by EmpID and clustered by
 month, which I guess is the natural sorting you want.



 Hope it helps,

 Cheers!


   Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso



 On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote:

 Hi,

I have basic doubt: I have an RDBMS with the following two tables:

Emp - EmpID, FN, LN, Phone, Address
Sal - Month, Empid, Basic, Flexible Allowance

My use case is to print the Salary slip at the end of each month and
 the slip contains emp name and his other details.

Now, if I want to have the same in cassandra, I will have a single cf
 with emp personal details and his salary details.  Is this the right
 approach?  Should we have the employee personal details duplicated each
 month?

 Regards,
 Seenu.







Re: Example Data Modelling

2015-07-07 Thread Rory Bramwell, DevOp Services
Hi,

I've been following this thread and my thoughts are inline with Carlos'
latest response... Model your data to suite your queries. That is one of
the data model / design considerations in Cassandra that differs from the
RDBMS world. Embrace demoralization and data duplication. Disk space is
cheapest, so exploit how your data is laid out in order to optimize for
faster reads (which are more costly than writes).

Regards,

Rory Bramwell
Founder and CEO
DevOp Services

Skype: devopservices
Email: rory.bramw...@devopservices.com
Web: www.devopservices.com
On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote:

 I guess you're right, using my proposal, getting last employee's record is
 straightforward and quick, but also, as Peter pointed, getting all slips
 for a particular month requires you to know all the employee IDs and,
 ideally, run a query for each employee. This would work depending on how
 many employees you're managing.

 At this moment I'm beginning to feel that maybe using both approaches is
 the best way to go. And I think this is one of Cassandra's recommendations:
 Write your data in several formats if required to fit your reads. Therefore
 I'd use my suggestion for getting a salary by employee ID and I'd also have
 Peter's one to run the end of the month query.
 Does it make sense?

 Cheers!

 Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso

 On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote:

 Thanks for the inputs.

 Now my question is how should the app populate the duplicate data, i.e.,
 if I have an employee record (along with his FN, LN,..) for the month of
 Apr and later I am populating the same record for the month of may (with
 salary changed), should my application first read/fetch the corresponding
 data for apr and re-insert with modification for month of may?

 Regards,
 Seenu.

 On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote:

  The data model suggested isn’t optimal for the “end of month” query
 you want to run since you are not querying by partition key.

 The query would look like “select EmpID, FN, LN, basic from salaries
 where month = 1” which requires filtering and has unpredictable performance.



 For this type of query to be fast you can use the “month” column as the
 partition key and the “EmpID” and the clustering column.

 This approach also has drawbacks:

 1. This data model creates a wide row. Depending on the number of
 employees this partition might be very large. You should limit partition
 sizes to 25MB

 2. Distributing data according to month means that only a small number
 of nodes will hold all of the salary data for a specific month which might
 cause hotspots on those nodes.



 Choose the approach that works best for you.





 *From:* Carlos Alonso [mailto:i...@mrcalonso.com]
 *Sent:* Monday, July 06, 2015 7:04 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Example Data Modelling



 Hi Srinivasa,



 I think you're right, In Cassandra you should favor denormalisation when
 in RDBMS you find a relationship like this.



 I'd suggest a cf like this

 CREATE TABLE salaries (

   EmpID varchar,

   FN varchar,

   LN varchar,

   Phone varchar,

   Address varchar,

   month integer,

   basic integer,

   flexible_allowance float,

   PRIMARY KEY(EmpID, month)

 )



 That way the salaries will be partitioned by EmpID and clustered by
 month, which I guess is the natural sorting you want.



 Hope it helps,

 Cheers!


   Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso



 On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote:

 Hi,

I have basic doubt: I have an RDBMS with the following two tables:

Emp - EmpID, FN, LN, Phone, Address
Sal - Month, Empid, Basic, Flexible Allowance

My use case is to print the Salary slip at the end of each month and
 the slip contains emp name and his other details.

Now, if I want to have the same in cassandra, I will have a single cf
 with emp personal details and his salary details.  Is this the right
 approach?  Should we have the employee personal details duplicated each
 month?

 Regards,
 Seenu.








Re: Example Data Modelling

2015-07-07 Thread Carlos Alonso
Hi Jerome,

Good point!! Really a nice usage of static columns! BTW, wouldn't the EmpID
be static as well?

Cheers

Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso

On 7 July 2015 at 14:42, Jérôme Mainaud jer...@mainaud.com wrote:

 Hello,

 You can slightly adapt Carlos answer to reduce repliation of data that
 don't change for month to month.
 Static columns are great for this.

 The table become:

 CREATE TABLE salaries (
   EmpID varchar,
   FN varchar *static*,
   LN varchar *static*,
   Phone varchar *static*,
   Address varchar *static*,
   month integer,
   basic integer,
   flexible_allowance float,
   PRIMARY KEY(EmpID, month)
 )

 There is only one copy of static column per partition the value is shared
 between all rows of the partition.
 When Employee data change you can update it with the partition key in the
 where clause.
 When you insert a new month entry you just fill non static columns.
 The table can be queried the same way as the original one.

 Cheers



 --
 Jérôme Mainaud
 jer...@mainaud.com

 2015-07-07 11:51 GMT+02:00 Rory Bramwell, DevOp Services 
 rory.bramw...@devopservices.com:

 Hi,

 I've been following this thread and my thoughts are inline with Carlos'
 latest response... Model your data to suite your queries. That is one of
 the data model / design considerations in Cassandra that differs from the
 RDBMS world. Embrace demoralization and data duplication. Disk space is
 cheapest, so exploit how your data is laid out in order to optimize for
 faster reads (which are more costly than writes).

 Regards,

 Rory Bramwell
 Founder and CEO
 DevOp Services

 Skype: devopservices
 Email: rory.bramw...@devopservices.com
 Web: www.devopservices.com
 On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote:

 I guess you're right, using my proposal, getting last employee's record
 is straightforward and quick, but also, as Peter pointed, getting all slips
 for a particular month requires you to know all the employee IDs and,
 ideally, run a query for each employee. This would work depending on how
 many employees you're managing.

 At this moment I'm beginning to feel that maybe using both approaches is
 the best way to go. And I think this is one of Cassandra's recommendations:
 Write your data in several formats if required to fit your reads. Therefore
 I'd use my suggestion for getting a salary by employee ID and I'd also have
 Peter's one to run the end of the month query.
 Does it make sense?

 Cheers!

 Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso

 On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote:

 Thanks for the inputs.

 Now my question is how should the app populate the duplicate data,
 i.e., if I have an employee record (along with his FN, LN,..) for the month
 of Apr and later I am populating the same record for the month of may (with
 salary changed), should my application first read/fetch the corresponding
 data for apr and re-insert with modification for month of may?

 Regards,
 Seenu.

 On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote:

  The data model suggested isn’t optimal for the “end of month” query
 you want to run since you are not querying by partition key.

 The query would look like “select EmpID, FN, LN, basic from salaries
 where month = 1” which requires filtering and has unpredictable 
 performance.



 For this type of query to be fast you can use the “month” column as
 the partition key and the “EmpID” and the clustering column.

 This approach also has drawbacks:

 1. This data model creates a wide row. Depending on the number of
 employees this partition might be very large. You should limit partition
 sizes to 25MB

 2. Distributing data according to month means that only a small number
 of nodes will hold all of the salary data for a specific month which might
 cause hotspots on those nodes.



 Choose the approach that works best for you.





 *From:* Carlos Alonso [mailto:i...@mrcalonso.com]
 *Sent:* Monday, July 06, 2015 7:04 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Example Data Modelling



 Hi Srinivasa,



 I think you're right, In Cassandra you should favor denormalisation
 when in RDBMS you find a relationship like this.



 I'd suggest a cf like this

 CREATE TABLE salaries (

   EmpID varchar,

   FN varchar,

   LN varchar,

   Phone varchar,

   Address varchar,

   month integer,

   basic integer,

   flexible_allowance float,

   PRIMARY KEY(EmpID, month)

 )



 That way the salaries will be partitioned by EmpID and clustered by
 month, which I guess is the natural sorting you want.



 Hope it helps,

 Cheers!


   Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso



 On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote:

 Hi,

I have basic doubt: I have an RDBMS with the following two tables:

Emp - EmpID, FN, LN, Phone, Address
Sal - Month, Empid, Basic, Flexible Allowance

Re: Example Data Modelling

2015-07-07 Thread John Sanda
25 MB seems very specific. Is there a reason why?

On Tuesday, July 7, 2015, Peer, Oded oded.p...@rsa.com wrote:

  The data model suggested isn’t optimal for the “end of month” query you
 want to run since you are not querying by partition key.

 The query would look like “select EmpID, FN, LN, basic from salaries where
 month = 1” which requires filtering and has unpredictable performance.



 For this type of query to be fast you can use the “month” column as the
 partition key and the “EmpID” and the clustering column.

 This approach also has drawbacks:

 1. This data model creates a wide row. Depending on the number of
 employees this partition might be very large. You should limit partition
 sizes to 25MB

 2. Distributing data according to month means that only a small number of
 nodes will hold all of the salary data for a specific month which might
 cause hotspots on those nodes.



 Choose the approach that works best for you.





 *From:* Carlos Alonso [mailto:i...@mrcalonso.com
 javascript:_e(%7B%7D,'cvml','i...@mrcalonso.com');]
 *Sent:* Monday, July 06, 2015 7:04 PM
 *To:* user@cassandra.apache.org
 javascript:_e(%7B%7D,'cvml','user@cassandra.apache.org');
 *Subject:* Re: Example Data Modelling



 Hi Srinivasa,



 I think you're right, In Cassandra you should favor denormalisation when
 in RDBMS you find a relationship like this.



 I'd suggest a cf like this

 CREATE TABLE salaries (

   EmpID varchar,

   FN varchar,

   LN varchar,

   Phone varchar,

   Address varchar,

   month integer,

   basic integer,

   flexible_allowance float,

   PRIMARY KEY(EmpID, month)

 )



 That way the salaries will be partitioned by EmpID and clustered by month,
 which I guess is the natural sorting you want.



 Hope it helps,

 Cheers!


   Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso



 On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com
 javascript:_e(%7B%7D,'cvml','seen...@gmail.com'); wrote:

 Hi,

I have basic doubt: I have an RDBMS with the following two tables:

Emp - EmpID, FN, LN, Phone, Address
Sal - Month, Empid, Basic, Flexible Allowance

My use case is to print the Salary slip at the end of each month and
 the slip contains emp name and his other details.

Now, if I want to have the same in cassandra, I will have a single cf
 with emp personal details and his salary details.  Is this the right
 approach?  Should we have the employee personal details duplicated each
 month?

 Regards,
 Seenu.





-- 

- John


Re: Example Data Modelling

2015-07-07 Thread Jérôme Mainaud
Hello,

You can slightly adapt Carlos answer to reduce repliation of data that
don't change for month to month.
Static columns are great for this.

The table become:

CREATE TABLE salaries (
  EmpID varchar,
  FN varchar *static*,
  LN varchar *static*,
  Phone varchar *static*,
  Address varchar *static*,
  month integer,
  basic integer,
  flexible_allowance float,
  PRIMARY KEY(EmpID, month)
)

There is only one copy of static column per partition the value is shared
between all rows of the partition.
When Employee data change you can update it with the partition key in the
where clause.
When you insert a new month entry you just fill non static columns.
The table can be queried the same way as the original one.

Cheers



-- 
Jérôme Mainaud
jer...@mainaud.com

2015-07-07 11:51 GMT+02:00 Rory Bramwell, DevOp Services 
rory.bramw...@devopservices.com:

 Hi,

 I've been following this thread and my thoughts are inline with Carlos'
 latest response... Model your data to suite your queries. That is one of
 the data model / design considerations in Cassandra that differs from the
 RDBMS world. Embrace demoralization and data duplication. Disk space is
 cheapest, so exploit how your data is laid out in order to optimize for
 faster reads (which are more costly than writes).

 Regards,

 Rory Bramwell
 Founder and CEO
 DevOp Services

 Skype: devopservices
 Email: rory.bramw...@devopservices.com
 Web: www.devopservices.com
 On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote:

 I guess you're right, using my proposal, getting last employee's record
 is straightforward and quick, but also, as Peter pointed, getting all slips
 for a particular month requires you to know all the employee IDs and,
 ideally, run a query for each employee. This would work depending on how
 many employees you're managing.

 At this moment I'm beginning to feel that maybe using both approaches is
 the best way to go. And I think this is one of Cassandra's recommendations:
 Write your data in several formats if required to fit your reads. Therefore
 I'd use my suggestion for getting a salary by employee ID and I'd also have
 Peter's one to run the end of the month query.
 Does it make sense?

 Cheers!

 Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso

 On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote:

 Thanks for the inputs.

 Now my question is how should the app populate the duplicate data, i.e.,
 if I have an employee record (along with his FN, LN,..) for the month of
 Apr and later I am populating the same record for the month of may (with
 salary changed), should my application first read/fetch the corresponding
 data for apr and re-insert with modification for month of may?

 Regards,
 Seenu.

 On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote:

  The data model suggested isn’t optimal for the “end of month” query
 you want to run since you are not querying by partition key.

 The query would look like “select EmpID, FN, LN, basic from salaries
 where month = 1” which requires filtering and has unpredictable 
 performance.



 For this type of query to be fast you can use the “month” column as the
 partition key and the “EmpID” and the clustering column.

 This approach also has drawbacks:

 1. This data model creates a wide row. Depending on the number of
 employees this partition might be very large. You should limit partition
 sizes to 25MB

 2. Distributing data according to month means that only a small number
 of nodes will hold all of the salary data for a specific month which might
 cause hotspots on those nodes.



 Choose the approach that works best for you.





 *From:* Carlos Alonso [mailto:i...@mrcalonso.com]
 *Sent:* Monday, July 06, 2015 7:04 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Example Data Modelling



 Hi Srinivasa,



 I think you're right, In Cassandra you should favor denormalisation
 when in RDBMS you find a relationship like this.



 I'd suggest a cf like this

 CREATE TABLE salaries (

   EmpID varchar,

   FN varchar,

   LN varchar,

   Phone varchar,

   Address varchar,

   month integer,

   basic integer,

   flexible_allowance float,

   PRIMARY KEY(EmpID, month)

 )



 That way the salaries will be partitioned by EmpID and clustered by
 month, which I guess is the natural sorting you want.



 Hope it helps,

 Cheers!


   Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso



 On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote:

 Hi,

I have basic doubt: I have an RDBMS with the following two tables:

Emp - EmpID, FN, LN, Phone, Address
Sal - Month, Empid, Basic, Flexible Allowance

My use case is to print the Salary slip at the end of each month and
 the slip contains emp name and his other details.

Now, if I want to have the same in cassandra, I will have a single
 cf with emp personal details and his salary details.  Is this the right
 approach

Example Data Modelling

2015-07-06 Thread Srinivasa T N
Hi,

   I have basic doubt: I have an RDBMS with the following two tables:

   Emp - EmpID, FN, LN, Phone, Address
   Sal - Month, Empid, Basic, Flexible Allowance

   My use case is to print the Salary slip at the end of each month and the
slip contains emp name and his other details.

   Now, if I want to have the same in cassandra, I will have a single cf
with emp personal details and his salary details.  Is this the right
approach?  Should we have the employee personal details duplicated each
month?

Regards,
Seenu.


Re: Example Data Modelling

2015-07-06 Thread Carlos Alonso
Hi Srinivasa,

I think you're right, In Cassandra you should favor denormalisation when in
RDBMS you find a relationship like this.

I'd suggest a cf like this
CREATE TABLE salaries (
  EmpID varchar,
  FN varchar,
  LN varchar,
  Phone varchar,
  Address varchar,
  month integer,
  basic integer,
  flexible_allowance float,
  PRIMARY KEY(EmpID, month)
)

That way the salaries will be partitioned by EmpID and clustered by month,
which I guess is the natural sorting you want.

Hope it helps,
Cheers!

Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso

On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote:

 Hi,

I have basic doubt: I have an RDBMS with the following two tables:

Emp - EmpID, FN, LN, Phone, Address
Sal - Month, Empid, Basic, Flexible Allowance

My use case is to print the Salary slip at the end of each month and
 the slip contains emp name and his other details.

Now, if I want to have the same in cassandra, I will have a single cf
 with emp personal details and his salary details.  Is this the right
 approach?  Should we have the employee personal details duplicated each
 month?

 Regards,
 Seenu.