Jonathan,

Thanks for the comments.

I agree with your first point. It will be useful to plug in a user-defined
index analyzer. The analyzer takes a row with the indexed columns and can
extract whatever index keys that it likes. This way, an application can
choose what to index for different data types.

As for queries vs. low-level api, we can make both available to the
application developer. In general, what can be done in a single query may
have to be translated to multiple low-level api calls. Some apps may prefer
the former for efficiency.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

[email protected]



                                                                       
             Jonathan Ellis                                            
             <[email protected]                                         
             m>                                                         To
                                       [email protected]
             03/24/2009 10:48                                           cc
             AM                                                        
                                                                   Subject
                                       Re: secondary index support in  
             Please respond to         Cassandra                       
             cassandra-...@inc                                         
             ubator.apache.org                                         
                                                                       
                                                                       
                                                                       
                                                                       





This adds a lot of complexity but I definitely see people wanting easy
indexing out of the box.  So +1 in principle.

A few high-level comments:

First, for maximum flexibility, you probably want to allow indexes to
be defined in code.  That is, you'd define something like

  <ColumnFamily name="foo">
    <Index generator="com.ibm.cassandra.indexGenerator"/>
  </ColumnFamily>

and allow index generators to be loaded at runtime.  Nobody else is
going to need the specific case of
hash(rowkey):attribute1:attribute2:rowkey so abstract that out and
make it pluggable for whatever weird-ass requirements people have.

Second, I'm not a fan of queries by parsing strings.  The whole rdbms
world has been moving _away_ from SQL and towards OO interfaces for
the last 10 years.  I like the thrift API for this reason.  (It is a
little clunky in Java, but _everything_ is a little clunky in Java.
Much better in Python/Ruby/etc.)

Finally, as an implementation detail, Cassandra already does too much
in-memory when writing and merging sstables.  Don't make it worse. :)

-Jonathan

P.S. the partitioner abstraction layer in CASSANDRA-3 will allow you
to do the per-node grouping you want without weird contortions.

On Tue, Mar 24, 2009 at 11:21 AM, Jun Rao <[email protected]> wrote:
> To address the above problems, we are thinking of the following new
> implementation. Each entity is mapped to a row in Cassandra and uses a
> two-part key (groupID, entityID). We use the groupID to hash an entity to
a
> node. This way, all entities for a group will be collocated in the same
> node. We then define a special CF to serve as the secondary index. In the
> definition, we specify what entity attributes need to be indexed  and in
> what order. Within a node, this special CF will index all rows stored
> locally. Every time we insert a new entity, the server automatically
> extracts the index key based on the index definition (for example, the
> index key can be of the form "hash(rowkey):attribute1:attribute2:rowkey)
> and add the index entry to the special CF. We can then access the
entities
> using an extended version of the query language in Cassandra. For
example,
> if we issue the following query and there is an index defined by
> (attributeX, attributeY), the query can be evaluated using the index in
the
> special CF. (Note that AppEngine supports this flavor of queries.)
>
> select attributeZ
> from ROWS(HASH = hash(groupID))
> where attributeX="x"
> order by attributeY desc
> limit 50
>
> We are in the middle of prototyping this approach. We'd like to hear if
> other people are interested in this too or if people think there are
better
> alternatives.
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> [email protected]

Reply via email to