On Fri, Jul 31, 2009 at 5:42 PM, Colin Mollenhour<[email protected]> wrote: > This reply keeps getting blocked as spam so I am just sending to you > directly.. > > Jonathan, thank you very much for the excellent response. If I may, a few > more questions (inline): > > > One caveat is that the subcolumns of supercolumns are not indexed. > > When you query those, Cassandra reads the entire Supercolumn into > > memory. So they are best suited for small bunches of attributes, not > > up to 60k events. > > Given that subcolumns of SCs are not indexed it seems that the only time it > makes sense to use them is when some or most of the subcolumns will be > needed within the same request, otherwise you could just have a separate > simple CF for each sub-group of data. Is there any other reason to use a SC?
Most generally, it's useful when you want a dynamic "container", since supercolumns can come into existence as needed but CFs are more static. > For example on Evan Weavers blog post he gives this diagram: > http://blog.evanweaver.com/files/cassandra/twitter.jpg with subcolumns > user_timeline and home_timeline of the UserRelationships SC. But, because > they will never be requested simultaneously, these would be better off if > they were each their own simple CF, right? That's what it looks like to me. > > If the event names cannot clash with user names then you might just > > put all of the data / event / permissions data in the same row without > > extra namespacing. Otherwise, you will have to put each of those > > types of data in a single row. Which is better depends on your query > > needs. (My initial impression is the 2nd is a better fit for you > > here.) > > I'm not sure I follow you here but the reason I had them as SC:CF is that > pending_events is something I need to be able to add/remove from easily and > permissions will always be retrieved as a full list. In many cases I think > these will need to be fetched to serve the same request. What is the > drawback of this approach that I am failing to see? My impression was that pending_events is likely to be large, in which case per the above it is a bad fit for a SC. Otherwise it is fine. > > There's a related problem with your type index: Cassandra still > > materializes entire rows in memory at compaction time (see > > CASSANDRA-16). So for now you might want to split those across rows > > as $type|$journalid, in a simple columnfamily with each row only about > > that one journal. Then you can do range queries to get the journals > > needed, then slice for the events as needed. > > Cool. Will it ever be possible to retrieve the actual columns from a range > query rather than just the keys within the range? Yes. The only question is when someone will need it enough to code it. :) > > One other suggestion would be that it generally simplifies things to > > use natural keys, rather than surrogate (_id keys). And if you do use > > surrogate keys, use UUIDs rather than numeric counters. > > I am having trouble finding anything on how to use UUIDs. Even a search on > the wiki for UUID has no results and all of the examples set the id > explicitly.. How do I do this using the Thrift interface? Column names are byte[] now, and a UUID is just 16 bytes laid out the right way. How you generate the UUID in the first place and serialize it to byte[] is going to be client language dependent. (For Python, the tests in test/system/test_server.py have an example.) > > No. If anything, you may not be denormalizing enough. Having CFs > > like the event details off by itself when that's not directly needing > > to be queried looks fishy. > > The take-away seems to be, "Design your schema as if you are using a > key/value hash and then group CFs together under a SC only if they are > frequently retrieved in-full by the same app request.". Is there a point at > which this wouldn't be true because your data was so denormalized that you > had too many indexes, or does that just mean that Cassandra is not a good > fit for the application? In general, Cassandra is a poor fit where you need to do lots of ad-hoc queries. But I don't think that's what you have here. -Jonathan
