Hi All- I am new to Cassandra and evaluating it as part of a trade study of NOSQL solutions. I have done some prototyping and reading of the mailing list and blog posts, but was hoping to get some basic suggestions about designing a scalable data model. My example use case is storing and querying metadata about Person objects. A Person object would have a flexible set of metadata such as sex, firstName, lastName, age, location, etc. I need to be able to store billions of Person objects. I also need to be able to execute queries such as “Find all people whose sex == ‘male’ and age >= 20 and age <= 29.” My naïve first attempt had 2 ColumnFamilies:
ColumnFamily: Person Key: Person ID (a unique ID for each Person object) Columns: The metadata name and value for that Person. Example: Person : { 000-00-0000 : { sex : male, firstName : Jared … } } ColumnFamily: Metadata Key: Metadata Name SuperColumns: Metadata Values SubColumns: Person ID with that metadata value Example: Metadata : { sex : { male : { 000-00-0000 : 000-00-0000, 000-00-0001 : 000-00-0000, … } female : { … } } firstName : { …. } } This model works great for small datasets, but I am concerned that it has big issues as the number of records grows. For example, in the Metadata ColumnFamily, for key “sex” and SuperColumn “male”, the number of subcolumns is going to get huge, ~50% of the number of records in the database. Somehow I need to partition the data better. Would a recommendation be to “split” the “sex” key into multiple keys? For example I could append the year and month to the key (“sex_022010”) to partition the data by the month it was insert. Any other basic recommendations that those of you with experience have? Thanks a lot for any suggestions you have. Jared Winick