Re: Example Data Modelling

Saladi Naidu Wed, 08 Jul 2015 13:53:37 -0700

If going by Month as partition key then you need to duplicate the data. I dont 
think going with name as partition key is good datamodel practice as it will 
create a hotspot. Also I believe your queries will be mostly by employee not by 
month. 
You can create employee id as partition key and month as clustering and keep 
employee details as static columns so they wont be repeated  Naidu Saladi

      From: Srinivasa T N <seen...@gmail.com>
 To: "user@cassandra.apache.org" <user@cassandra.apache.org> 
 Sent: Tuesday, July 7, 2015 3:07 AM
 Subject: Re: Example Data Modelling

Thanks for the inputs.

Now my question is how should the app populate the duplicate data, i.e., if I 
have an employee record (along with his FN, LN,..) for the month of Apr and 
later I am populating the same record for the month of may (with salary 
changed), should my application first read/fetch the corresponding data for apr 
and re-insert with modification for month of may?

Regards,
Seenu.

On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded <oded.p...@rsa.com> wrote:

The data model suggested isn’t optimal for the “end of month” query you want to 
run since you are not querying by partition key.The query would look like 
“select EmpID, FN, LN, basic from salaries where month = 1” which requires 
filtering and has unpredictable performance. For this type of query to be fast 
you can use the “month” column as the partition key and the “EmpID” and the 
clustering column.This approach also has drawbacks:1. This data model creates a 
wide row. Depending on the number of employees this partition might be very 
large. You should limit partition sizes to 25MB2. Distributing data according 
to month means that only a small number of nodes will hold all of the salary 
data for a specific month which might cause hotspots on those nodes. Choose the 
approach that works best for you.  From: Carlos Alonso 
[mailto:i...@mrcalonso.com]
Sent: Monday, July 06, 2015 7:04 PM
To: user@cassandra.apache.org
Subject: Re: Example Data Modelling Hi Srinivasa, I think you're right, In 
Cassandra you should favor denormalisation when in RDBMS you find a 
relationship like this. I'd suggest a cf like thisCREATE TABLE salaries (  
EmpID varchar,  FN varchar,  LN varchar,  Phone varchar,  Address varchar,  
month integer,  basic integer,  flexible_allowance float,  PRIMARY KEY(EmpID, 
month)) That way the salaries will be partitioned by EmpID and clustered by 
month, which I guess is the natural sorting you want. Hope it helps,Cheers!
Carlos Alonso | Software Engineer | @calonso On 6 July 2015 at 13:01, Srinivasa 
T N <seen...@gmail.com> wrote:Hi,   I have basic doubt: I have an RDBMS with 
the following two tables:

   Emp - EmpID, FN, LN, Phone, Address
   Sal - Month, Empid, Basic, Flexible Allowance

   My use case is to print the Salary slip at the end of each month and the 
slip contains emp name and his other details.

   Now, if I want to have the same in cassandra, I will have a single cf with 
emp personal details and his salary details.  Is this the right approach?  
Should we have the employee personal details duplicated each month?

Regards,
Seenu.

Re: Example Data Modelling

Reply via email to