Re: Data Model Question

aaron morton Mon, 23 Jan 2012 01:22:55 -0800

> 1. regarding time slicing, if at any point of time I am interested in what 
> happened in the last T minutes, then I will need to query more than one row 
> of the DimentionUpdates, right?
Yerp. 
Sometimes that's is what's needed.


> 2. What did you mean by "You will also want to partition the list of actions 
> in an actiongroup"
In the ActionGroup CF in your model you have a column for every action in the 
group. You may want to partition that so the row does not grow for ever. 

> 3. For insert of a new ActionGroup, I will need to
Sounds about right. Don't forget you want to round the current time to the 
start of the current time partition. So if it's 08:38 and you have 5 minute 
partitions it will be written to the 08:35 row. 

> Is there any problem of synchronization there?
You can send them as a batch, but only the mutations for the same rows are 
atomic. And the changes are not isolated, i.e. a read can get a partially 
updated row. So your client needs to handle cases where referential integrity 
is somehow broken.

See this presentation from Matt Dennis about how you can implement an 
application level transaction log if that's something you need 
http://www.slideshare.net/mattdennis/cassandra-nyc-2011-data-modeling

> Does that makes sense? Do you think your solution is still better?
> 

You may get poor read performance with that option. Overwriting the same row 
over a long period can result in poor read performance, this is better in 1.0 
with leveled compaction but I cannot tell you how much better off the top of my 
head. Also TTL columns must be discarded during a read, and are only purged on 
compaction. A high number of dead columns in a row can reduce read performance. 
So you will want to reverse the comparator 
(http://thelastpickle.com/2011/10/03/Reverse-Comparators/) so you can avoid 
reading them. 

Cheers
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/01/2012, at 8:43 PM, Tamar Fraenkel wrote:

> Hi!
> Thank you very much for your response!
>  
> I have couple of questions regarding it, some are just to make sure I 
> understood you:
>  
> 1. regarding time slicing, if at any point of time I am interested in what 
> happened in the last T minutes, then I will need to query more than one row 
> of the DimentionUpdates, right?
> 
> 2. What did you mean by "You will also want to partition the list of actions 
> in an actiongroup"
> 
> 3. For insert of a new ActionGroup, I will need to
> 
> a) add (in some cases overwrite) columns for all the new ActionGroup 
> catetories and tags in the time_partition row of  DimensionUpdates.
> b) add to all the rows in DimentionFacts that are relevant (time, categories 
> and tags) a column for the ActionGroupId.
> c) add all the information to the ActionGroup CF and Action CF, including the 
> tags\categories as column names and their scores as values.
> Is there any problem of synchronization there?
> 
> 4. regarding the row size, wouldn't that be solved by giving TTL to the 
> columns
> 
>  
> 5. After reading a bit I also thought that built in secondary indices won't 
> work. There is also another complication, that having tagA with CategoryA is 
> different than having tagA with CategoryB.
> 
> But I still thought that something like
> 
>  
> CF: ActionGroupByCategoryTag
> Stores which action groups are tagged by each tag.
> key: <category_name:tag_name> 
> col_names: <action_group_creation_time>
> col_value: <action_group_id>
>  
> And then I will get categories*tags rows from this index table, and get 
> sliced columns by the time range i am interested in. I can define some TTL on 
> the columns to get only ActionGroups created in the last T minutes.
> 
>  
> Does that makes sense? Do you think your solution is still better?
> 
>  
> Tamar
> 
> 
> 
> On January 22, 2012 at 9:53 PM aaron morton <aa...@thelastpickle.com> wrote: 
> 
> In general if you are collecting data over time you should consider 
> partitioning the row's to avoid creating very large rows. Also if you have a 
> common request you want to support consider modeling it directly rather than 
> using secondary indexes. 
>  
> Assuming my understanding of the problem is in some one way correct I would 
> consider this for the Action Groups…
>  
> Pick a time partition, this is the minimum time resolution you are interested 
> in (your T Minutes). 
>  
> CF: DimensionUpdates
> Stores which dimensions (tags, categories) were updated in the time 
> partition. 
>  
> key: <time_partition>  is the start of the partition e.g. 2011-01-23T08:30  
> col_names: <dimension_name:dimension_value> where <dimension_name> is "tag" 
> or "category" and <dimension_value> is a value from that domain. e.g. 
> <tag:foo>
> col_value: empty
>  
>  
> CF: DimensionFacts
> Stores the facts that included the dimension in a time partition. 
>  
> key: <time_partition:dimension_name:dimension_value> definitions as above.
> col_names: ActionGroupID. 
> col_values: empty
>  
> So to…
>  
>  
> Find all the recent ActionGroups (those who were updated with actions 
> performed during the last T minutes), who has at list one of the new action’s 
> categories AND at list one of the new action’s tags. 
>  
> 1) Query the DimensionUpdates CF with the  current time partition as the key, 
> and the tags and columns the action group has. 
> 2) For each column returned from (1) query the rows in DimensionFacts to get 
> the ActionGroups.
> 3) Filter the unique set of ActionGroups client side. 
>  
>  
> Some notes:
> 1) Row size in all cases are bound to the time partition size. This will make 
> your life easier when it comes to repair and compaction. By default rows 
> sizes of 64MB will take a slower 2 pass approach that will cost you IO. 
>  
> 2) All queries are bound. Query 1 will only want request 1 to 35 columns from 
> a row that contains 0 to 35 columns. Query 2 can be done as either a multi 
> get (select with lots of KEY clauses) or a series of multi gets, and can be 
> further bound by limiting the number of columns in each request. Making 
> queries that take for a lot of rows at once can harm overall query 
> throughput.   
>  
> 3) Overwrites (writing the to the same row) are bound by the time partition. 
> Depending on load this *may* mean that rows are only physically written to 
> one SSTable. 
>  
> 4) You will also want to partition the list of actions in an actiongroup.
>  
> Hope that helps. 
>  
>  
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 20/01/2012, at 10:11 PM, Tamar Fraenkel wrote:
> 
> Hi!
> I am a newbie to Cassandra and seeking some advice regarding the data model I 
> should use to best address my needs.
> 
> For simplicity, what I want to accomplish is:
> 
> I have a system that has users (potentially ~10,000 per day) and they perform 
> actions in the system (total of ~50,000 a day).
> 
> Each User’s action is taking place in a certain point in time, and is also 
> classified into categories (1 to 5) and tagged by 1-30 tags. Each action’s 
> Categories and Tags has a score associated with it, the score is between 0 to 
> 1 (let’s assume precision of 0.0001).
> 
> I want to be able to identify similar actions in the system (performed 
> usually by more than one user). Similarity of actions is calculated based on 
> their common Categories and Tags taking scores into account.
> 
> I need the system to store:
> 
> The list of my users with attributes like name, age etc
> For each action – the categories and tags associated with it and their score, 
> the time of the action, and the user who performed it.
> Groups of similar actions (ActionGroups) – the id’s of actions in the group, 
> the categories and tags describing the group, with their scores. Those are 
> calculated using an algorithm that takes into account the categories and tags 
> of the actions in the group.
> When a user performs a new action in the system, I want to add it to a 
> fitting ActionGroups (with similar categories and tags).
> 
> For this I need to be able to perform the following:
> 
> Find all the recent ActionGroups (those who were updated with actions 
> performed during the last T minutes), who has at list one of the new action’s 
> categories AND at list one of the new action’s tags.
> 
>  
> I thought of two ways to address the issue and I would appreciate your 
> insights.
> 
>  
> First one using secondary indexes
> 
> Column Family: Users
> 
> Key: userId
> 
> Compare with Bytes Type
> 
> Columns: name: <>, age: <> etc…
> 
>  
> Column Family: Actions
> 
> Key: actionId
> 
> Compare with Bytes Type
> 
> Columns:  Category1 : <Score> ….
> 
>           CategoriN: <Score>,
> 
>           Tag1 : <Score>, ….
> 
>           TagK:<Score>
> 
>           Time: timestamp
> 
>           user: userId
> 
>  
> Column Family: ActionGroups
> 
> Key: actionGroupId
> 
> Compare with Bytes Type
> 
> Columns: Category1 : <Score> ….
> 
>          CategoriN: <Score>,
> 
>          Tag1 : <Score> ….
> 
>          TagK:<Score>
> 
>          lastUpdateTime: timestamp
> 
>          actionId1: null, … ,
> 
>          actionIdM: null
> 
>  
> I will then define secondary index on each tag columns, category columns, and 
> the update time column.
> 
> Let’s assume the new action I want to add to ActionGroup has 
> NewActionCategory1 - NewActionCategoryK, and has NewActionTag1 – 
> NewActionTagN. I will perform the following query:
> 
> Select  * From ActionGroups where
> 
>    (NewActionCategory1 > 0  … or NewActionCategoryK > 0) and
> 
>    (NewActionTag1 > 0  … or NewActionTagN > 0) and
> 
>    lastUpdateTime > T;
> 
>  
> Second solution
> 
> Have the same CF as in the first solution without the secondary index , and 
> have two additional CF-ies:
> 
> Column Family: CategoriesToActionGroupId
> 
> Key: categoryId
> 
> Compare with ByteType
> 
> Columns: {Timestamp, ActionGroupsId1 } : null
> 
>          {Timestamp, ActionGroupsId2} : null
> 
>          ...
> 
> *timestamp is the update time for the ActionGroup
> 
>  
> A similar CF will be defined for tags.
> 
>  
> I will then be able to run several queries on CategoriesToActionGroupId (one 
> for each of the new story Categories), with column slice for the right update 
> time of the ActionGroup.
> 
> I will do the same for the TagsToActionGroupId.
> 
> I will then use my client code to remove duplicates (ActionGroups who are 
> associated with more than one Tag or Category).
> 
>  
> My questions are:
> 
> Are the two solutions viable? If yes, which is better
> Is there any better way of doing this?
> Can I use jdbc and CQL with both method, or do I have to use Hector (I am 
> using Java).
> Thanks
> 
> Tamar
> 
>  
>  
> 
>

Re: Data Model Question

Reply via email to