Jonathan, Thanks for the comments.
I agree with your first point. It will be useful to plug in a user-defined index analyzer. The analyzer takes a row with the indexed columns and can extract whatever index keys that it likes. This way, an application can choose what to index for different data types. As for queries vs. low-level api, we can make both available to the application developer. In general, what can be done in a single query may have to be translated to multiple low-level api calls. Some apps may prefer the former for efficiency. Jun IBM Almaden Research Center K55/B1, 650 Harry Road, San Jose, CA 95120-6099 [email protected] Jonathan Ellis <[email protected] m> To [email protected] 03/24/2009 10:48 cc AM Subject Re: secondary index support in Please respond to Cassandra cassandra-...@inc ubator.apache.org This adds a lot of complexity but I definitely see people wanting easy indexing out of the box. So +1 in principle. A few high-level comments: First, for maximum flexibility, you probably want to allow indexes to be defined in code. That is, you'd define something like <ColumnFamily name="foo"> <Index generator="com.ibm.cassandra.indexGenerator"/> </ColumnFamily> and allow index generators to be loaded at runtime. Nobody else is going to need the specific case of hash(rowkey):attribute1:attribute2:rowkey so abstract that out and make it pluggable for whatever weird-ass requirements people have. Second, I'm not a fan of queries by parsing strings. The whole rdbms world has been moving _away_ from SQL and towards OO interfaces for the last 10 years. I like the thrift API for this reason. (It is a little clunky in Java, but _everything_ is a little clunky in Java. Much better in Python/Ruby/etc.) Finally, as an implementation detail, Cassandra already does too much in-memory when writing and merging sstables. Don't make it worse. :) -Jonathan P.S. the partitioner abstraction layer in CASSANDRA-3 will allow you to do the per-node grouping you want without weird contortions. On Tue, Mar 24, 2009 at 11:21 AM, Jun Rao <[email protected]> wrote: > To address the above problems, we are thinking of the following new > implementation. Each entity is mapped to a row in Cassandra and uses a > two-part key (groupID, entityID). We use the groupID to hash an entity to a > node. This way, all entities for a group will be collocated in the same > node. We then define a special CF to serve as the secondary index. In the > definition, we specify what entity attributes need to be indexed and in > what order. Within a node, this special CF will index all rows stored > locally. Every time we insert a new entity, the server automatically > extracts the index key based on the index definition (for example, the > index key can be of the form "hash(rowkey):attribute1:attribute2:rowkey) > and add the index entry to the special CF. We can then access the entities > using an extended version of the query language in Cassandra. For example, > if we issue the following query and there is an index defined by > (attributeX, attributeY), the query can be evaluated using the index in the > special CF. (Note that AppEngine supports this flavor of queries.) > > select attributeZ > from ROWS(HASH = hash(groupID)) > where attributeX="x" > order by attributeY desc > limit 50 > > We are in the middle of prototyping this approach. We'd like to hear if > other people are interested in this too or if people think there are better > alternatives. > > Jun > IBM Almaden Research Center > K55/B1, 650 Harry Road, San Jose, CA 95120-6099 > > [email protected]
