The compaction optimization that Prashant mentioned is likely to solve many of the problems that Jun brings up.
We were thinking of tackling this problem ... I've opened a ticket in JIRA (https://issues.apache.org/jira/browse/CASSANDRA-16) Avinash, Prashant -- If you guys are already working on it, feel free to assign it to yourself. Otherwise we'll sketch out a plan and send it out, if the community agrees on the idea, we can start hacking away. Sandeep On Wed, Mar 25, 2009 at 10:50 AM, Jun Rao <jun...@almaden.ibm.com> wrote: > > Some comments inlined below. > > Jun > IBM Almaden Research Center > K55/B1, 650 Harry Road, San Jose, CA 95120-6099 > > jun...@almaden.ibm.com > > > Avinash Lakshman <avinash.laksh...@gmail.com> wrote on 03/24/2009 10:08:45 > PM: > >> Comments inline. >> >> On Tue, Mar 24, 2009 at 6:53 PM, Jun Rao <jun...@almaden.ibm.com> wrote: >> >> > >> > Prashant, >> > >> > Thanks for the comments. They are quite useful. Let me try to address > some >> > of the points that you made. >> > >> > 1. It is true that in our current implementation, we can glue the > changes >> > on both the data and the index in one batch_update() call. This way, > the >> > data and the index will be maintained synchronously. However, > maintaining >> > the index on the server is likely more efficient since there is less >> > communication overhead. You seem to agree with this. >> >> >> [Avinash] You can update multiple column families for a single key in one >> mutation. >> >> > >> > >> > 2. Cassandra currently doesn't acquire row-lock for row accesses. > However, >> > the implication is that a reader may see partial updates of a row. For >> > example, suppose that a writer updates two columns in different CFs. > Then, >> > it is possible for a concurrent reader to see the update on one column, > but >> > not the other one. For some applications, row-level consistency could > be >> > important. It's probably for this reason, in HBase, a region server >> > acquires a row lock for every read and write. >> >> >> [Avinash] Updates to a single row within a machine are atomic. Which > means >> what you are stating will not happen. Writes and reads will be serialized > at >> the Memtable. > > This problem doesn't show up in Cassandra today because there is no method > that can read columns from different CFs in a row. If there were such a > method, it would be hard to enforce that a reader always sees a complete > update (updating multiple CFs) without some sort of row locks. > >> >> > >> > >> > 3. For our current application, the size of all entities in a group is > not >> > too large and likely fits within the capacity of a single node. > However, >> > for other applications, being able to scale a group to more than a node >> > could be useful. Storing a group within a single row will prevent > scaling >> > out the group. >> >> [Avinash] I guess the question is how many entities do you envision in a >> group. What do you mean by fitting into one node? >> > > A large group may not fit in memory, but should fit in a commodity disk. > The compaction optimization Prashant mentioned will definitely make our > current approach more feasible. > > However, in general, I am a bit concerned about putting too much stuff > within a row. A row is a unit that has finite capacity and a user shouldn't > expect to put an infinite number of columns within a row. I actually like > the current assumption in Cassandra that a row has to fit in memory since > it simplifies the implementation. On the other hand, a table can have > arbitrary capacity (one just need to provision enough nodes in the cluster) > and it can have as many rows as you want. > >> > >> > >> > Jun >> > IBM Almaden Research Center >> > K55/B1, 650 Harry Road, San Jose, CA 95120-6099 >> > >> > jun...@almaden.ibm.com >> > >> > >> > Prashant Malik <pma...@gmail.com> wrote on 03/24/2009 11:34:51 AM: >> > >> > > Some questions Iline >> > > >> > > On Tue, Mar 24, 2009 at 10:21 AM, Jun Rao <jun...@almaden.ibm.com> >> > wrote: >> > > >> > > > >> > > > >> > > > We have an application that has groups and entities. A group has > many >> > > > entities and an entity has a bunch of (attribute, value) pairs. A >> > common >> > > > access pattern is to select some number of entities within a group > with >> > > > attribute X equals to x and ordered by attribute Y. For efficiency, > we >> > want >> > > > to build a secondary index for each group and collocate a group and > its >> > > > secondary index on the same node. Our current approach is to map a >> > group to >> > > > a row in Cassandra and each entity to a column in a column family > (CF). >> > > > Within the same row, we use a separate CF (ordered by name) to >> > implement a >> > > > secondary index, say on attribute X and Y. In this family, each > column >> > name >> > > > has the form of X:x:Y:y:entityID. We extended the get_slice() > function >> > so >> > > > that it can get a slice of columns starting from a given column. > The >> > > > extended function uses the column-level index to locate the > starting >> > column >> > > > quickly. (We'd be happy to contribute this extension back to > Cassandra >> > if >> > > > people find this useful). Using the extended get_slice(), we were > able >> > to >> > > > access the entities through the simulated secondary index. >> > > > >> > > > We see a couple of problems with the current approach. First, our >> > > > application has to maintain the index. This is inefficient and > could >> > leave >> > > > the index inconsistent when failure occurs. Second, mapping each >> > entity to >> > > > a column may not be a good idea. Often, there is some sort of > locking >> > for >> > > > each row access. Putting many entities per row limits concurrency. >> > Today, >> > > > in Cassandra, a full row is deserialized into memory during > compaction. >> > > > This limits the number of entities that can be put in a single row. >> > Also, >> > > > intuitively, an entity is more naturally represented as a row with >> > > > attributes stored as columns. >> > > >> > > >> > > Prashant >> > > >> > > 1. Application can send the index and the entityId update in the > same >> > > write , a write per row is always atomic given that teh index and the >> > data >> > > have teh same key in the above >> > > case the index will not be out of sync. >> > > 2. The maintainance of index by app can be moved into cassandra > and I >> > > agree with fact that you can add support for it by a built in special > CF >> > > which you have to do in either of the approaches you are taking. >> > > Infact in the first approach that you are taking it will be >> > easier >> > > to move the indexes in case of adding nodes to the cluster and when > files >> > > are split and data is sent over. In the second approach this >> > > process could get comlicated. >> > > 3. There is no locking for row access in cassandra. >> > > 4. Compactions can be opmtimized for name sorted columns , this > is >> > one >> > > of the workitems we have where we do not deserialize the entire row > in >> > > compactiopn but only do it in slices , this is easily achievable for >> > > name sorted columns. >> > > 5. The entity can still be represented naturally as a supercolumn >> > where >> > > each of the super column name is the entity Id and each of the > columns in >> > > the supercolumn are attribute value pairs. >> > > 6. How many entities per groupid are we talking about here ? > why do >> > > you feel concurrency is limited by entities per row ? >> > > >> > > >> > > > >> > > > To address the above problems, we are thinking of the following new >> > > > implementation. Each entity is mapped to a row in Cassandra and > uses a >> > > > two-part key (groupID, entityID). We use the groupID to hash an > entity >> > to a >> > > > node. This way, all entities for a group will be collocated in the > same >> > > > node. We then define a special CF to serve as the secondary index. > In >> > the >> > > > definition, we specify what entity attributes need to be indexed > and >> > in >> > > > what order. Within a node, this special CF will index all rows > stored >> > > > locally. Every time we insert a new entity, the server > automatically >> > > > extracts the index key based on the index definition (for example, > the >> > > > index key can be of the form "hash >> > (rowkey):attribute1:attribute2:rowkey) >> > > > and add the index entry to the special CF. We can then access the >> > entities >> > > > using an extended version of the query language in Cassandra. For >> > example, >> > > > if we issue the following query and there is an index defined by >> > > > (attributeX, attributeY), the query can be evaluated using the > index in >> > the >> > > > special CF. (Note that AppEngine supports this flavor of queries.) >> > > > >> > > > select attributeZ >> > > > from ROWS(HASH = hash(groupID)) >> > > > where attributeX="x" >> > > > order by attributeY desc >> > > > limit 50 >> > > > >> > > > We are in the middle of prototyping this approach. We'd like to > hear if >> > > > other people are interested in this too or if people think there > are >> > better >> > > > alternatives. >> > > > >> > > >> > > >> > > Prashant >> > > >> > > 1. In this approach I see a number of problems during addition , >> > removal >> > > and moving of new nodes. These problems are not impossible to solve > but >> > > just harder and inefficient with the above approach. >> > > 2. What are the wins of this approach over the other is not clear > to >> > me. >> > > could you please highlight those. >> > > >> > > >> > > > >> > > > Jun >> > > > IBM Almaden Research Center >> > > > K55/B1, 650 Harry Road, San Jose, CA 95120-6099 >> > > > >> > > > jun...@almaden.ibm.com >> >