On Tue, Jul 28, 2009 at 4:26 AM, Colin Mollenhour<[email protected]> wrote: > I need to be able to fetch all or latest events with the following > "queries": > -A specific journal > -All of a user's journals > -A specific event type > -A specific event type for a specific journal > -A specific event type for all of a user's journals > > After much deliberation in trying to figure out how to do the above > without having to loop through many many queries here is the schema I > arrived at: > http://bit.ly/6Hj9I > If I am correct in my thinking, all of the above cases can be retrieved > in one or two steps with the maximum number of queries being determined > by the number of journals in question.
I think you have the right idea. And thanks for taking the trouble to draw a diagram, that was very useful. :) One caveat is that the subcolumns of supercolumns are not indexed. When you query those, Cassandra reads the entire Supercolumn into memory. So they are best suited for small bunches of attributes, not up to 60k events. If the event names cannot clash with user names then you might just put all of the data / event / permissions data in the same row without extra namespacing. Otherwise, you will have to put each of those types of data in a single row. Which is better depends on your query needs. (My initial impression is the 2nd is a better fit for you here.) There's a related problem with your type index: Cassandra still materializes entire rows in memory at compaction time (see CASSANDRA-16). So for now you might want to split those across rows as $type|$journalid, in a simple columnfamily with each row only about that one journal. Then you can do range queries to get the journals needed, then slice for the events as needed. One other suggestion would be that it generally simplifies things to use natural keys, rather than surrogate (_id keys). And if you do use surrogate keys, use UUIDs rather than numeric counters. > Am I wrong to try to reduce the number of indexes and round-trips to the > database by modeling this way? No. If anything, you may not be denormalizing enough. Having CFs like the event details off by itself when that's not directly needing to be queried looks fishy. > Some more general questions: > My model assumes the use of get_slice_by_names with a potentially large > number of keys, is that ok? For the numbers you are talking about (< 100,000) it should be. Just be aware that serialization of the request won't be negligible at those numbers. Using get_slice with start and finish ranges will be more efficient in that respect. > Cassandra lacks transactions and increment methods, is there a way to > generate unique user ids with just Cassandra as the authority that I am > missing? Yeah, UUIDs as above. > Is it silly to use short column names for the sake of performance or > storage efficiency? E.g. uid instead of user_id. I like verbose names... IMO, that is unlikely to make the difference between a workable solution and an unworkable one. -Jonathan
