Re: secondary index support in Cassandra

Avinash Lakshman Tue, 24 Mar 2009 19:45:49 -0700

I think Prashant brought up some very good points. The response would be
very helpful to understand the best way to do this.
Avinash


On Tue, Mar 24, 2009 at 6:33 PM, Jun Rao <[email protected]> wrote:

> Jonathan,
>
> Thanks for the comments.
>
> I agree with your first point. It will be useful to plug in a user-defined
> index analyzer. The analyzer takes a row with the indexed columns and can
> extract whatever index keys that it likes. This way, an application can
> choose what to index for different data types.
>
> As for queries vs. low-level api, we can make both available to the
> application developer. In general, what can be done in a single query may
> have to be translated to multiple low-level api calls. Some apps may prefer
> the former for efficiency.
>
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA 95120-6099
>
> [email protected]
>
> [image: Inactive hide details for Jonathan Ellis <[email protected]>]
> Jonathan Ellis <[email protected]>
>
>
>
>     *Jonathan Ellis <[email protected]>*
>
>             03/24/2009 10:48 AM
>             Please respond to
>             [email protected]
>
>
>
> To
>
> [email protected]
> cc
>
>
> Subject
>
> Re: secondary index support in Cassandra
>
>
> This adds a lot of complexity but I definitely see people wanting easy
> indexing out of the box.  So +1 in principle.
>
> A few high-level comments:
>
> First, for maximum flexibility, you probably want to allow indexes to
> be defined in code.  That is, you'd define something like
>
>  <ColumnFamily name="foo">
>    <Index generator="com.ibm.cassandra.indexGenerator"/>
>  </ColumnFamily>
>
> and allow index generators to be loaded at runtime.  Nobody else is
> going to need the specific case of
> hash(rowkey):attribute1:attribute2:rowkey so abstract that out and
> make it pluggable for whatever weird-ass requirements people have.
>
> Second, I'm not a fan of queries by parsing strings.  The whole rdbms
> world has been moving _away_ from SQL and towards OO interfaces for
> the last 10 years.  I like the thrift API for this reason.  (It is a
> little clunky in Java, but _everything_ is a little clunky in Java.
> Much better in Python/Ruby/etc.)
>
> Finally, as an implementation detail, Cassandra already does too much
> in-memory when writing and merging sstables.  Don't make it worse. :)
>
> -Jonathan
>
> P.S. the partitioner abstraction layer in CASSANDRA-3 will allow you
> to do the per-node grouping you want without weird contortions.
>
> On Tue, Mar 24, 2009 at 11:21 AM, Jun Rao <[email protected]> wrote:
> > To address the above problems, we are thinking of the following new
> > implementation. Each entity is mapped to a row in Cassandra and uses a
> > two-part key (groupID, entityID). We use the groupID to hash an entity to
> a
> > node. This way, all entities for a group will be collocated in the same
> > node. We then define a special CF to serve as the secondary index. In the
> > definition, we specify what entity attributes need to be indexed  and in
> > what order. Within a node, this special CF will index all rows stored
> > locally. Every time we insert a new entity, the server automatically
> > extracts the index key based on the index definition (for example, the
> > index key can be of the form "hash(rowkey):attribute1:attribute2:rowkey)
> > and add the index entry to the special CF. We can then access the
> entities
> > using an extended version of the query language in Cassandra. For
> example,
> > if we issue the following query and there is an index defined by
> > (attributeX, attributeY), the query can be evaluated using the index in
> the
> > special CF. (Note that AppEngine supports this flavor of queries.)
> >
> > select attributeZ
> > from ROWS(HASH = hash(groupID))
> > where attributeX="x"
> > order by attributeY desc
> > limit 50
> >
> > We are in the middle of prototyping this approach. We'd like to hear if
> > other people are interested in this too or if people think there are
> better
> > alternatives.
> >
> > Jun
> > IBM Almaden Research Center
> > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
> >
> > [email protected]
>
>

Re: secondary index support in Cassandra

Reply via email to